18. KDD 2012:Beijing, China

The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12-16, 2012. ACM 【DBLP Link

Paper Num: 206 || Session Num: 47

Keynote addresses 4

1. Nine real hard problems we'd like you to solve.

Paper Link】 【Pages】:1

【Authors】: Robin Li

【Abstract】: Since 2000, Baidu has set its mission as providing the best way for people to find what they're looking for. Today, the company has become the world's largest Chinese search engine. Everyday, we process billions of search queries and serve hundreds of millions of internet users. The huge volume of online text and multimedia content, as well as user log data, provide us unprecedented opportunities and challenges for further accomplishing our mission. In addition, we have seen several megatrends, cloud computing is becoming a pervasive service infrastructure, mobile internet will surpass traditional internet in user time spend, and social platform has demonstrated its great power. In response to all these opportunities, challenges, and megatrends, we must think ahead on the major technology focuses that may help Baidu to serve our users better. In this talk, I would like to share with the audience the nine areas that are the most important and interesting in my mind. For each of the areas, I will describe the challenges and explain why it is important and interesting. I hope the research community can get excited, and help us provide better services for users.

【Keywords】: baidu; search engine

2. Mining heterogeneous information networks: the next frontier.

Paper Link】 【Pages】:2-3

【Authors】: Jaiwei Han

【Abstract】: Real world physical and abstract data objects are interconnected, forming gigantic, interconnected networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real world applications that handle big data, including interconnected social media and social networks, scientific, engineering, or medical information systems, online e-commerce systems, and most database systems, can be structured into heterogeneous information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous information networks poses an interesting but critical challenge. In this talk, we present a set of data mining scenarios in heterogeneous information networks and show that mining heterogeneous information networks is a new and promising research frontier in data mining research. Departing from many existing network models that view data as homogeneous graphs or networks, the semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data. This heterogeneous network modeling will lead to the discovery of a set of new principles and methodologies for mining interconnected data. The examples to be used in this discussion include (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta path-based link/relationship prediction, (5) relation strength-aware mining, as well as a few other recent developments. We will also point out some promising research directions and provide convincing arguments on that mining heterogeneous information networks is the next frontier in data mining.

【Keywords】: data mining; heterogeneous information networks

3. Divide-and-conquer and statistical inference for big data.

Paper Link】 【Pages】:4

【Authors】: Michael I. Jordan

【Abstract】: I present some recent work on statistical inference for Big Data. Divide-and-conquer is a natural computational paradigm for approaching Big Data problems, particularly given recent developments in distributed and parallel computing, but some interesting challenges arise when applying divide-and-conquer algorithms to statistical inference problems. One interesting issue is that of obtaining confidence intervals in massive datasets. The bootstrap principle suggests resampling data to obtain fluctuations in the values of estimators, and thereby confidence intervals, but this is infeasible with massive data. Subsampling the data yields fluctuations on the wrong scale, which have to be corrected to provide calibrated statistical inferences. I present a new procedure, the "bag of little bootstraps," which circumvents this problem, inheriting the favorable theoretical properties of the bootstrap but also having a much more favorable computational profile. Another issue that I discuss is the problem of large-scale matrix completion. Here divide-and-conquer is a natural heuristic that works well in practice, but new theoretical problems arise when attempting to characterize the statistical performance of divide-and-conquer algorithms. Here the theoretical support is provided by concentration theorems for random matrices, and I present a new approach to this problem based on Stein's method1.

【Keywords】: big data; confidence intervals; divide-and-conquer; subsampling

4. Experiments in social computation: (and the data they generate).

Paper Link】 【Pages】:5

【Authors】: Michael Kearns

【Abstract】: For a number of years we have been conducting controlled human-subject experiments in distributed social computation in networks with only limited and local communication. These experiments cast a number of traditional computational, economic and sociological problems (including graph coloring, consensus, independent set, networked bargaining, biased voting and network formation) as games of strategic interaction in which subjects have financial incentives to collectively "compute" global solutions. I will overview and summarize the many behavioral findings from this line of experimentation. I will give particular emphasis to the novel data the experiments have generated, and the analyses this data has permitted, including quantitative studies of subject "personality" traits such as stubbornness, altruism, and patience, and whether those traits seem helpful or harmful to individual and collective performance.

【Keywords】: experiments; social network

Research session a1: page rank and social networks 5

5. Rise and fall patterns of information diffusion: model and implications.

Paper Link】 【Pages】:6-14

【Authors】: Yasuko Matsubara ; Yasushi Sakurai ; B. Aditya Prakash ; Lei Li ; Christos Faloutsos

【Abstract】: The recent explosion in the adoption of search engines and new media such as blogs and Twitter have facilitated faster propagation of news and rumors. How quickly does a piece of news spread over these media? How does its popularity diminish over time? Does the rising and falling pattern follow a simple universal law? In this paper, we propose SpikeM, a concise yet flexible analytical model for the rise and fall patterns of influence propagation. Our model has the following advantages: (a) unification power: it generalizes and explains earlier theoretical models and empirical observations; (b) practicality: it matches the observed behavior of diverse sets of real data; (c) parsimony: it requires only a handful of parameters; and (d) usefulness: it enables further analytics tasks such as fore- casting, spotting anomalies, and interpretation by reverse- engineering the system parameters of interest (e.g. quality of news, count of interested bloggers, etc.). Using SpikeM, we analyzed 7.2GB of real data, most of which were collected from the public domain. We have shown that our SpikeM model accurately and succinctly describes all the patterns of the rise-and-fall spikes in these real datasets.

【Keywords】: information diffusion; social networks

6. Efficient personalized pagerank with accuracy assurance.

Paper Link】 【Pages】:15-23

【Authors】: Yasuhiro Fujiwara ; Makoto Nakatsuji ; Takeshi Yamamuro ; Hiroaki Shiokawa ; Makoto Onizuka

【Abstract】: Personalize PageRank (PPR) is an effective relevance (proximity) measure in graph mining. The goal of this paper is to efficiently compute single node relevance and top-k/highly relevant nodes without iteratively computing the relevances of all nodes. Based on a "random surfer model", PPR iteratively computes the relevances of all nodes in a graph until convergence for a given user preference distribution. The problem with this iterative approach is that it cannot compute the relevance of just one or a few nodes. The heart of our solution is to compute single node relevance accurately in non-iterative manner based on sparse matrix representation, and to compute top-k/highly relevant nodes exactly by pruning unnecessary relevance computations based on upper/lower relevance estimations. Our experiments show that our approach is up to seven orders of magnitude faster than the existing alternatives.

【Keywords】: personalized pagerank; relevance computation

7. PageRank on an evolving graph.

Paper Link】 【Pages】:24-32

【Authors】: Bahman Bahmani ; Ravi Kumar ; Mohammad Mahdian ; Eli Upfal

【Abstract】: One of the most important features of the Web graph and social networks is that they are constantly evolving. The classical computational paradigm, which assumes a fixed data set as an input to an algorithm that terminates, is inadequate for such settings. In this paper we study the problem of computing PageRank on an evolving graph. We propose an algorithm that, at any moment in the time and by crawling a small portion of the graph, provides an estimate of the PageRank that is close to the true PageRank of the graph at that moment. We will also evaluate our algorithm experimentally on real data sets and on randomly generated inputs. Under a stylized model of graph evolution, we show that our algorithm achieves a provable performance guarantee that is significantly better than the naive algorithm that crawls the nodes in a round-robin fashion.

【Keywords】: dynamic graphs; pagerank; random walks

8. Information diffusion and external influence in networks.

Paper Link】 【Pages】:33-41

【Authors】: Seth A. Myers ; Chenguang Zhu ; Jure Leskovec

【Abstract】: Social networks play a fundamental role in the diffusion of information. However, there are two different ways of how information reaches a person in a network. Information reaches us through connections in our social networks, as well as through the influence external out-of-network sources, like the mainstream media. While most present models of information adoption in networks assume information only passes from a node to node via the edges of the underlying network, the recent availability of massive online social media data allows us to study this process in more detail. We present a model in which information can reach a node via the links of the social network or through the influence of external sources. We then develop an efficient model parameter fitting technique and apply the model to the emergence of URL mentions in the Twitter network. Using a complete one month trace of Twitter we study how information reaches the nodes of the network. We quantify the external influences over time and describe how these influences affect the information adoption. We discover that the information tends to "jump" across the network, which can only be explained as an effect of an unobservable external influence on the network. We find that only about 71% of the information volume in Twitter can be attributed to network diffusion, and the remaining 29% is due to external events and factors outside the network.

【Keywords】: diffusion of innovations; external influence; information cascades; information diffusion; social networks; twitter

9. The missing models: a data-driven approach for learning how networks grow.

Paper Link】 【Pages】:42-50

【Authors】: Robert Patro ; Geet Duggal ; Emre Sefer ; Hao Wang ; Darya Filippova ; Carl Kingsford

【Abstract】: Probabilistic models of network growth have been extensively studied as idealized representations of network evolution. Models, such as the Kronecker model, duplication-based models, and preferential attachment models, have been used for tasks such as representing null models, detecting anomalies, algorithm testing, and developing an understanding of various mechanistic growth processes. However, developing a new growth model to fit observed properties of a network is a difficult task, and as new networks are studied, new models must constantly be developed. Here, we present a framework, called GrowCode, for the automatic discovery of novel growth models that match user-specified topological features in undirected graphs. GrowCode introduces a set of basic commands that are general enough to encode several previously developed models. Coupling this formal representation with an optimization approach, we show that GrowCode is able to discover models for protein interaction networks, autonomous systems networks, and scientific collaboration networks that closely match properties such as the degree distribution, the clustering coefficient, and assortativity that are observed in real networks of these classes. Additional tests on simulated networks show that the models learned by GrowCode generate distributions of graphs with similar variance as existing models for these classes.

【Keywords】: algorithms; graph mining; network growth models

Research session a2: pattern mining 5

10. Finding minimum representative pattern sets.

Paper Link】 【Pages】:51-59

【Authors】: Guimei Liu ; Haojun Zhang ; Limsoon Wong

【Abstract】: Frequent pattern mining often produces an enormous number of frequent patterns, which imposes a great challenge on understanding and further analysis of the generated patterns. This calls for finding a small number of representative patterns to best approximate all other patterns. An ideal approach should 1) produce a minimum number of representative patterns; 2) restore the support of all patterns with error guarantee; and 3) have good efficiency. Few existing approaches can satisfy all the three requirements. In this paper, we develop two algorithms, MinRPset and FlexRPset, for finding minimum representative pattern sets. Both algorithms provide error guarantee. MinRPset produces the smallest solution that we can possibly have in practice under the given problem setting, and it takes a reasonable amount of time to finish. FlexRPset is developed based on MinRPset. It provides one extra parameter K to allow users to make a trade-off between result size and efficiency. Our experiment results show that MinRPset and FlexRPset produce fewer representative patterns than RPlocal---an efficient algorithm that is developed for solving the same problem. FlexRPset can be slightly faster than RPlocal when K is small.

【Keywords】: frequent pattern summarization; representative patterns

11. Mining emerging patterns by streaming feature selection.

Paper Link】 【Pages】:60-68

【Authors】: Kui Yu ; Wei Ding ; Dan A. Simovici ; Xindong Wu

【Abstract】: Building an accurate emerging pattern classifier with a high-dimensional dataset is a challenging issue. The problem becomes even more difficult if the whole feature space is unavailable before learning starts. This paper presents a new technique on mining emerging patterns using streaming feature selection. We model high feature dimensions with streaming features, that is, features arrive and are processed one at a time. As features flow in one by one, we online evaluate each coming feature to determine whether it is useful for mining predictive emerging patterns (EPs) by exploiting the relationship between feature relevance and EP discriminability (the predictive ability of an EP). We employ this relationship to guide an online EP mining process. This new approach can mine EPs from a high-dimensional dataset, even when its entire feature set is unavailable before learning. The experiments on a broad range of datasets validate the effectiveness of the proposed approach against other well-established methods, in terms of predictive accuracy, pattern numbers and running time.

【Keywords】: emerging patterns; feature relevance; streaming features

12. Linear space direct pattern sampling using coupling from the past.

Paper Link】 【Pages】:69-77

【Authors】: Mario Boley ; Sandy Moens ; Thomas Gärtner

【Abstract】: This paper shows how coupling from the past (CFTP) can be used to avoid time and memory bottlenecks in direct local pattern sampling procedures. Such procedures draw controlled amounts of suitably biased samples directly from the pattern space of a given dataset in polynomial time. Previous direct pattern sampling methods can produce patterns in rapid succession after some initial preprocessing phase. This preprocessing phase, however, turns out to be prohibitive in terms of time and memory for many datasets. We show how CFTP can be used to avoid any super-linear preprocessing and memory requirements. This allows to simulate more complex distributions, which previously were intractable. We show for a large number of public real-world datasets that these new algorithms are fast to execute and their pattern collections outperform previous approaches both in unsupervised as well as supervised contexts.

【Keywords】: cftp; frequent sets; local patterns; sampling

13. Mining top-K high utility itemsets.

Paper Link】 【Pages】:78-86

【Authors】: Cheng-Wei Wu ; Bai-En Shie ; Vincent S. Tseng ; Philip S. Yu

【Abstract】: Mining high utility itemsets from databases is an emerging topic in data mining, which refers to the discovery of itemsets with utilities higher than a user-specified minimum utility threshold min_util. Although several studies have been carried out on this topic, setting an appropriate minimum utility threshold is a difficult problem for users. If min_util is set too low, too many high utility itemsets will be generated, which may cause the mining algorithms to become inefficient or even run out of memory. On the other hand, if min_util is set too high, no high utility itemset will be found. Setting appropriate minimum utility thresholds by trial and error is a tedious process for users. In this paper, we address this problem by proposing a new framework named top-k high utility itemset mining, where k is the desired number of high utility itemsets to be mined. An efficient algorithm named TKU (Top-K Utility itemsets mining) is proposed for mining such itemsets without setting min_util. Several features were designed in TKU to solve the new challenges raised in this problem, like the absence of anti-monotone property and the requirement of lossless results. Moreover, TKU incorporates several novel strategies for pruning the search space to achieve high efficiency. Results on real and synthetic datasets show that TKU has excellent performance and scalability.

【Keywords】: high utility itemset; top-k pattern mining; utility mining

14. Sampling minimal frequent boolean (DNF) patterns.

Paper Link】 【Pages】:87-95

【Authors】: Geng Li ; Mohammed J. Zaki

【Abstract】: We tackle the challenging problem of mining the simplest Boolean patterns from categorical datasets. Instead of complete enumeration, which is typically infeasible for this class of patterns, we develop effective sampling methods to extract a representative subset of the minimal Boolean patterns (in disjunctive normal form - DNF). We make both theoretical and practical contributions, which allow us to prune the search space based on provable properties. Our approach can provide a near-uniform sample of the minimal DNF patterns. We also show that the mined minimal DNF patterns are very effective when used as features for classification.

【Keywords】: boolean expression patterns; minimal generator; pattern-based classification; sampling

Research session a3: probabilistic models 5

15. The contextual focused topic model.

Paper Link】 【Pages】:96-104

【Authors】: Xu Chen ; Mingyuan Zhou ; Lawrence Carin

【Abstract】: A nonparametric Bayesian contextual focused topic model (cFTM) is proposed. The cFTM infers a sparse ("focused") set of topics for each document, while also leveraging contextual information about the author(s) and document venue. The hierarchical beta process, coupled with a Bernoulli process, is employed to infer the focused set of topics associated with each author and venue; the same construction is also employed to infer those topics associated with a given document that are unusual (termed "random effects"), relative to topics that are inferred as probable for the associated author(s) and venue. To leverage statistical strength and infer latent interrelationships between authors and venues, the Dirichlet process is utilized to cluster authors and venues. The cFTM automatically infers the number of topics needed to represent the corpus, the number of author and venue clusters, and the probabilistic importance of the author, venue and random-effect information on word assignment for a given document. Efficient MCMC inference is presented. Example results and interpretations are presented for two real datasets, demonstrating promising performance, with comparison to other state-of-the-art methods.

【Keywords】: bayesian nonparametric; clustering; dirich-let process; hierarchical beta process; topic modeling

16. Practical collapsed variational bayes inference for hierarchical dirichlet process.

Paper Link】 【Pages】:105-113

【Authors】: Issei Sato ; Kenichi Kurihara ; Hiroshi Nakagawa

【Abstract】: We propose a novel collapsed variational Bayes (CVB) inference for the hierarchical Dirichlet process (HDP). While the existing CVB inference for the HDP variant of latent Dirichlet allocation (LDA) is more complicated and harder to implement than that for LDA, the proposed algorithm is simple to implement, does not require variance counts to be maintained, does not need to set hyper-parameters, and has good predictive performance.

【Keywords】: collapsed variational bayes inference; hierarchical dirichlet process; latent dirichlet allocation; nonparametric bayes

17. Overlapping decomposition for causal graphical modeling.

Paper Link】 【Pages】:114-122

【Authors】: Lei Han ; Guojie Song ; Gao Cong ; Kunqing Xie

【Abstract】: Causal graphical models are developed to detect the dependence relationships between random variables and provide intuitive explanations for the relationships in complex systems. Most of existing work focuses on learning a single graphical model for all the variables. However, a single graphical model cannot accurately characterize the complicated causal relationships for a relatively large graph. In this paper, we propose the problem of estimating an overlapping decomposition for Gaussian graphical models of a large scale to generate overlapping sub-graphical models. Specifically, we formulate an objective function for the overlapping decomposition problem and propose an approximate algorithm for it. A key theory of the algorithm is that the problem of solving a κ+1 node graphical model can be reduced to the problem of solving a one-step regularization based on a solved κ node graphical model. Based on this theory, a greedy expansion algorithm is proposed to generate the overlapping subgraphs. We evaluate the effectiveness of our model on both synthetic datasets and real traffic dataset, and the experimental results show the superiority of our method.

【Keywords】: causality; graphical model; overlapping decomposition

18. TM-LDA: efficient online modeling of latent topic transitions in social media.

Paper Link】 【Pages】:123-131

【Authors】: Yu Wang ; Eugene Agichtein ; Michele Benzi

【Abstract】: Latent topic analysis has emerged as one of the most effective methods for classifying, clustering and retrieving textual data. However, existing models such as Latent Dirichlet Allocation (LDA) were developed for static corpora of relatively large documents. In contrast, much of the textual content on the web, and especially social media, is temporally sequenced, and comes in short fragments, including microblog posts on sites such as Twitter and Weibo, status updates on social networking sites such as Facebook and LinkedIn, or comments on content sharing sites such as YouTube. In this paper we propose a novel topic model, Temporal-LDA or TM-LDA, for efficiently mining text streams such as a sequence of posts from the same author, by modeling the topic transitions that naturally arise in these data. TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings. After training, TM-LDA is thus able to accurately predict the expected topic distribution in future posts. To make these predictions more efficient for a realistic online setting, we develop an efficient updating algorithm to adjust the topic transition parameters, as new documents stream in. Our empirical results, over a corpus of over 30 million microblog posts, show that TM-LDA significantly outperforms state-of-the-art static LDA models for estimating the topic distribution of new documents over time. We also demonstrate that TM-LDA is able to highlight interesting variations of common topic transitions, such as the differences in the work-life rhythm of cities, and factors associated with area-specific problems and complaints.

【Keywords】: mining social media data; temporal language models; topic transition modeling

19. Multi-view clustering using mixture models in subspace projections.

Paper Link】 【Pages】:132-140

【Authors】: Stephan Günnemann ; Ines Färber ; Thomas Seidl

【Abstract】: Detecting multiple clustering solutions is an emerging research field. While data is often multi-faceted in its very nature, traditional clustering methods are restricted to find just a single grouping. To overcome this limitation, methods aiming at the detection of alternative and multiple clustering solutions have been proposed. In this work, we present a Bayesian framework to tackle the problem of multi-view clustering. We provide multiple generalizations of the data by using multiple mixture models. Each mixture describes a specific view on the data by using a mixture of Beta distributions in subspace projections. Since a mixture summarizes the clusters located in similar subspace projections, each view highlights specific aspects of the data. In addition, our model handles overlapping views, where the mixture components compete against each other in the data generation process. For efficiently learning the distributions, we propose the algorithm MVGen that exploits the ICM principle and uses Bayesian model selection to trade-off the cluster model's complexity against its goodness of fit. With experiments on various real-world data sets, we demonstrate the high potential of MVGen to detect multiple, overlapping clustering views in subspace projections of the data.

【Keywords】: generative model; graphical model; model-based clustering

Research session a4: supervised learning 5

20. A simple methodology for soft cost-sensitive classification.

Paper Link】 【Pages】:141-149

【Authors】: Te-Kang Jan ; Da-Wei Wang ; Chi-Hung Lin ; Hsuan-Tien Lin

【Abstract】: Many real-world data mining applications need varying cost for different types of classification errors and thus call for cost-sensitive classification algorithms. Existing algorithms for cost-sensitive classification are successful in terms of minimizing the cost, but can result in a high error rate as the trade-off. The high error rate holds back the practical use of those algorithms. In this paper, we propose a novel cost-sensitive classification methodology that takes both the cost and the error rate into account. The methodology, called soft cost-sensitive classification, is established from a multicriteria optimization problem of the cost and the error rate, and can be viewed as regularizing cost-sensitive classification with the error rate. The simple methodology allows immediate improvements of existing cost-sensitive classification algorithms. Experiments on the benchmark and the real-world data sets show that our proposed methodology indeed achieves lower test error rates and similar (sometimes lower) test costs than existing cost-sensitive classification algorithms.

【Keywords】: classification; cost-sensitive learning; multicriteria optimization; regularization

21. Intelligible models for classification and regression.

Paper Link】 【Pages】:150-158

【Authors】: Yin Lou ; Rich Caruana ; Johannes Gehrke

【Abstract】: Complex models for regression and classification have high accuracy, but are unfortunately no longer interpretable by users. We study the performance of generalized additive models (GAMs), which combine single-feature models called shape functions through a linear function. Since the shape functions can be arbitrarily complex, GAMs are more accurate than simple linear models. But since they do not contain any interactions between features, they can be easily interpreted by users. We present the first large-scale empirical comparison of existing methods for learning GAMs. Our study includes existing spline and tree-based methods for shape functions and penalized least squares, gradient boosting, and backfitting for learning GAMs. We also present a new method based on tree ensembles with an adaptive number of leaves that consistently outperforms previous work. We complement our experimental results with a bias-variance analysis that explains how different shape models influence the additive model. Our experiments show that shallow bagged trees with gradient boosting distinguish itself as the best method on low- to medium-dimensional datasets.

【Keywords】: classification; intelligible models; regression

22. NASA: achieving lower regrets and faster rates via adaptive stepsizes.

Paper Link】 【Pages】:159-167

【Authors】: Hua Ouyang ; Alexander G. Gray

【Abstract】: The classic Stochastic Approximation (SA) method achieves optimal rates under the black-box model. This optimality does not rule out better algorithms when more information about functions and data is available. We present a family of Noise Adaptive Stochastic Approximation (NASA) algorithms for online convex optimization and stochastic convex optimization. NASA is an adaptive variant of Mirror Descent Stochastic Approximation. It is novel in its practical variation-dependent stepsizes and better theoretical guarantees. We show that comparing with state-of-the-art adaptive and non-adaptive SA methods, lower regrets and faster rates can be achieved under low-variation assumptions.

【Keywords】: adaptive learning; online convex optimization; online learning; stochastic optimization

23. Learning in non-stationary environments with class imbalance.

Paper Link】 【Pages】:168-176

【Authors】: Thomas Ryan Hoens ; Nitesh V. Chawla

【Abstract】: Learning in non-stationary environments is an increasingly important problem in a wide variety of real-world applications. In non-stationary environments data arrives incrementally, however the underlying generating function may change over time. In addition to the environments being non-stationary, they also often exhibit class imbalance. That is one class (the majority class) vastly outnumbers the other class (the minority class). This combination of class imbalance with non-stationary environments poses significant and interesting practical problems for classification. To overcome these issues, we introduce a novel instance selection mechanism, as well as provide a modification to the Heuristic Updatable Weighted Random Subspaces (HUWRS) method for the class imbalance problem. We then compare our modifications of HUWRS (called HUWRS.IP) to other state of the art algorithms, concluding that HUWRS. IP often achieves vastly superior performance.

【Keywords】: class imbalance; concept drift; non-stationary learning

24. Linear support vector machines via dual cached loops.

Paper Link】 【Pages】:177-185

【Authors】: Shin Matsushima ; S. V. N. Vishwanathan ; Alexander J. Smola

【Abstract】: Modern computer hardware offers an elaborate hierarchy of storage subsystems with different speeds, capacities, and costs associated with them. Furthermore, processors are now inherently parallel offering the execution of several diverse threads simultaneously. This paper proposes StreamSVM, the first algorithm for training linear Support Vector Machines (SVMs) which takes advantage of these properties by integrating caching with optimization. StreamSVM works by performing updates in the dual, thus obviating the need to rebalance frequently visited examples. Furthermore we trade off file I/O with data expansion on the fly by generating features on demand. This significantly increases throughput. Experiments show that StreamSVM outperforms other linear SVM solvers, including the award winning work of [38], by orders of magnitude and produces more accurate solutions within a shorter amount of time.

【Keywords】: coordinate descent; optimization; support vector machines

Industry/govt track a5: mobile computing 4

25. Discovering regions of different functions in a city using human mobility and POIs.

Paper Link】 【Pages】:186-194

【Authors】: Jing Yuan ; Yu Zheng ; Xing Xie

【Abstract】: The development of a city gradually fosters different functional regions, such as educational areas and business districts. In this paper, we propose a framework (titled DRoF) that Discovers Regions of different Functions in a city using both human mobility among regions and points of interests (POIs) located in a region. Specifically, we segment a city into disjointed regions according to major roads, such as highways and urban express ways. We infer the functions of each region using a topic-based inference model, which regards a region as a document, a function as a topic, categories of POIs (e.g., restaurants and shopping malls) as metadata (like authors, affiliations, and key words), and human mobility patterns (when people reach/leave a region and where people come from and leave for) as words. As a result, a region is represented by a distribution of functions, and a function is featured by a distribution of mobility patterns. We further identify the intensity of each function in different locations. The results generated by our framework can benefit a variety of applications, including urban planning, location choosing for a business, and social recommendations. We evaluated our method using large-scale and real-world datasets, consisting of two POI datasets of Beijing (in 2010 and 2011) and two 3-month GPS trajectory datasets (representing human mobility) generated by over 12,000 taxicabs in Beijing in 2010 and 2011 respectively. The results justify the advantages of our approach over baseline methods solely using POIs or human mobility.

【Keywords】: functional regions; human mobility; taxi trajectories; urban computing

Paper Link】 【Pages】:195-203

【Authors】: Ling-Yin Wei ; Yu Zheng ; Wen-Chih Peng

【Abstract】: The advances in location-acquisition technologies have led to a myriad of spatial trajectories. These trajectories are usually generated at a low or an irregular frequency due to applications' characteristics or energy saving, leaving the routes between two consecutive points of a single trajectory uncertain (called an uncertain trajectory). In this paper, we present a Route Inference framework based on Collective Knowledge (abbreviated as RICK) to construct the popular routes from uncertain trajectories. Explicitly, given a location sequence and a time span, the RICK is able to construct the top-k routes which sequentially pass through the locations within the specified time span, by aggregating such uncertain trajectories in a mutual reinforcement way (i.e., uncertain + uncertain → certain). Our work can benefit trip planning, traffic management, and animal movement studies. The RICK comprises two components: routable graph construction and route inference. First, we explore the spatial and temporal characteristics of uncertain trajectories and construct a routable graph by collaborative learning among the uncertain trajectories. Second, in light of the routable graph, we propose a routing algorithm to construct the top-k routes according to a user-specified query. We have conducted extensive experiments on two real datasets, consisting of Foursquare check-in datasets and taxi trajectories. The results show that RICK is both effective and efficient.

【Keywords】: collaborative learning; route inference; social media; trajectory data mining

27. GetJar mobile application recommendations with very sparse datasets.

Paper Link】 【Pages】:204-212

【Authors】: Kent Shi ; Kamal Ali

【Abstract】: The Netflix competition of 2006 [2] has spurred significant activity in the recommendations field, particularly in approaches using latent factor models [3,5,8,12] However, the near ubiquity of the Netflix and the similar MovieLens datasets1 may be narrowing the generality of lessons learned in this field. At GetJar, our goal is to make appealing recommendations of mobile applications (apps). For app usage, we observe a distribution that has higher kurtosis (heavier head and longer tail) than that for the aforementioned movie datasets. This happens primarily because of the large disparity in resources available to app developers and the low cost of app publication relative to movies. In this paper we compare a latent factor (PureSVD) and a memory-based model with our novel PCA-based model, which we call Eigenapp. We use both accuracy and variety as evaluation metrics. PureSVD did not perform well due to its reliance on explicit feedback such as ratings, which we do not have. Memory-based approaches that perform vector operations in the original high dimensional space over-predict popular apps because they fail to capture the neighborhood of less popular apps. They have high accuracy due to the concentration of mass in the head, but did poorly in terms of variety of apps exposed. Eigenapp, which exploits neighborhood information in low dimensional spaces, did well both on precision and variety, underscoring the importance of dimensionality reduction to form quality neighborhoods in high kurtosis distributions.

【Keywords】: PCA; evaluation; mobile application; recommender system; sparse data

28. Differentially private transit data publication: a case study on the montreal transportation system.

Paper Link】 【Pages】:213-221

【Authors】: Rui Chen ; Benjamin C. M. Fung ; Bipin C. Desai ; Nériah M. Sossou

【Abstract】: With the wide deployment of smart card automated fare collection (SCAFC) systems, public transit agencies have been benefiting from huge volume of transit data, a kind of sequential data, collected every day. Yet, improper publishing and use of transit data could jeopardize passengers' privacy. In this paper, we present our solution to transit data publication under the rigorous differential privacy model for the Société de transport de Montréal (STM). We propose an efficient data-dependent yet differentially private transit data sanitization approach based on a hybrid-granularity prefix tree structure. Moreover, as a post-processing step, we make use of the inherent consistency constraints of a prefix tree to conduct constrained inferences, which lead to better utility. Our solution not only applies to general sequential data, but also can be seamlessly extended to trajectory data. To our best knowledge, this is the first paper to introduce a practical solution for publishing large volume of sequential data under differential privacy. We examine data utility in terms of two popular data analysis tasks conducted at the STM, namely count queries and frequent sequential pattern mining. Extensive experiments on real-life STM datasets confirm that our approach maintains high utility and is scalable to large datasets.

【Keywords】: data mining; differential privacy; non-interactive release; transit data

Asia-Pacific track a6: session 1 4

29. Interaction and collective intelligence in internet computing.

Paper Link】 【Pages】:222

【Authors】: Deyi Li

【Abstract】: Network interconnection, information interoperability, and crowds interaction on the Internet could inspire better computation models than Turing machine, since that human plays an important factor in Internet computing, so that the human-machine and machine-machine interactions have evolved to be the kernel of Internet computing. Internet has not been simply equivalent to a virtual huge computer, or a set of computers. On the Internet, human's behaviors are uncertain, the interactions and influence among people are also uncertain. These uncertainties cannot be described by Turing machine and traditional interaction machine. As a new computation platform, Internet computing requires new theories and methods. By combining topology in mathematics with the field theory in physics, we propose the topological potential approach, which set up a virtual field by the topological space to reflect individual activities, local effects and preferential attachment. This approach can be used to research the emergence of collective intelligence. Here, we introduce three case studies to illustrate the analysis on the collective intelligence on the Internet and discuss some potential applications of the topological potential approach.

【Keywords】: collaborative intelligence; crowd computing

30. Building an engine for big data.

Paper Link】 【Pages】:223

【Authors】: Masaru Kitsuregawa

【Abstract】: IT program in Japan to build powerful engine for big data was launched. Quite recently the initial version is commercialized. This presentation will give a brief overview of the project. Also some of the potential applications will be introduced.

【Keywords】: big data; search engine

31. A new challenge of information processing under the 21st century.

Paper Link】 【Pages】:224

【Authors】: Bo Zhang

【Abstract】: In web era, we are confronted with a huge amount of raw data and a tremendous change of man-machine interaction modes. We have to deal with the content (semantics) of data rather than their form alone. Traditional information processing approaches face a new challenge since they cannot deal with the semantic meaning or content of information. But humans can handle such a problem easily. So it's needed a new information processing strategy that correlated with the content of information by learning some mechanisms from human beings. Therefore, we need (1) a set of robust detectors for detecting semantically meaningful features such as boundaries, shapes, etc. in images, words, sentences, etc. in text, and (2) a set of methods that can effectively analyze and exploit the information structures that encode the content of information. During the past 40 years the probability theory has made a great progress. It has provided a set of mathematical tools for representing and analyzing information structures. In the talk we will discuss what difficulty we face, what we can do, and how we should do in the content-based information processing.

【Keywords】: content-based information processing; semantic web

32. Developing data mining applications.

Paper Link】 【Pages】:225

【Authors】: Geoff Holmes

【Abstract】: In this talk I will review several real-world applications developed at the University of Waikato over the past 15 years. These include the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples, and the analysis of gas chromatography mass spectrometry (GCMS) data, a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples. I will then briefly discuss how experience with these applications has led to the development of an open-source framework for application development.

【Keywords】: data mining application; environmental science

Research session b1: social opinions 4

33. Learning from crowds in the presence of schools of thought.

Paper Link】 【Pages】:226-234

【Authors】: Yuandong Tian ; Jun Zhu

【Abstract】: Crowdsourcing has recently become popular among machine learning researchers and social scientists as an effective way to collect large-scale experimental data from distributed workers. To extract useful information from the cheap but potentially unreliable answers to tasks, a key problem is to identify reliable workers as well as unambiguous tasks. Although for objective tasks that have one correct answer per task, previous works can estimate worker reliability and task clarity based on the single gold standard assumption, for tasks that are subjective and accept multiple reasonable answers that workers may be grouped into, a phenomenon called schools of thought, existing models cannot be trivially applied. In this work, we present a statistical model to estimate worker reliability and task clarity without resorting to the single gold standard assumption. This is instantiated by explicitly characterizing the grouping behavior to form schools of thought with a rank-1 factorization of a worker-task groupsize matrix. Instead of performing an intermediate inference step, which can be expensive and unstable, we present an algorithm to analytically compute the sizes of different groups. We perform extensive empirical studies on real data collected from Amazon Mechanical Turk. Our method discovers the schools of thought, shows reasonable estimation of worker reliability and task clarity, and is robust to hyperparameter changes. Furthermore, our estimated worker reliability can be used to improve the gold standard prediction for objective tasks.

【Keywords】: crowdsourcing; pattern analysis; schools of thought

34. Social sampling.

Paper Link】 【Pages】:235-243

【Authors】: Anirban Dasgupta ; Ravi Kumar ; D. Sivakumar

【Abstract】: We investigate a class of methods that we call "social sampling," where participants in a poll respond with a summary of their friends' putative responses to the poll. Social sampling leads to a novel trade-off question: the savings in the number of samples(roughly the average degree of the network of participants) vs. the systematic bias in the poll due to the network structure. We provide precise analyses of estimators that result from this idea. With non-uniform sampling of nodes and non-uniform weighting of neighbors' responses, we devise an ideal unbiased estimator. We show that the variance of this estimator is controlled by the second eigenvalue of the normalized Laplacian of the network (the network structure penalty) and the correlation between node degrees and the property being measured (the effective savings factor). In addition, we present a sequence of approximate estimators that are simpler or more realistic or both, and analyze their performance. Experiments on large real-world networks show that social sampling is a powerful paradigm in obtaining accurate estimates with very few samples. At the same time, our results urge caution in interpreting recent results about "expectation vs. intent polling".

【Keywords】: polling; social networks

35. From user comments to on-line conversations.

Paper Link】 【Pages】:244-252

【Authors】: Chunyan Wang ; Mao Ye ; Bernardo A. Huberman

【Abstract】: We present an analysis of user conversations in on-line social media and their evolution over time. We propose a dynamic model that predicts the growth dynamics and structural properties of conversation threads. The model reconciles the differing observations that have been reported in existing studies. By separating artificial factors from user behavior, we show that there are actually underlying rules in common for on-line conversations in different social media websites. Results of our model are supported by empirical measurements throughout a number of different social media websites.

【Keywords】: conversation dynamics; social networks

36. eTrust: understanding trust evolution in an online world.

Paper Link】 【Pages】:253-261

【Authors】: Jiliang Tang ; Huiji Gao ; Huan Liu ; Atish Das Sarma

【Abstract】: Most existing research about online trust assumes static trust relations between users. As we are informed by social sciences, trust evolves as humans interact. Little work exists studying trust evolution in an online world. Researching online trust evolution faces unique challenges because more often than not, available data is from passive observation. In this paper, we leverage social science theories to develop a methodology that enables the study of online trust evolution. In particular, we propose a framework of evolution trust, eTrust, which exploits the dynamics of user preferences in the context of online product review. We present technical details about modeling trust evolution, and perform experiments to show how the exploitation of trust evolution can help improve the performance of online applications such as rating and trust prediction.

【Keywords】: multi-faceted trust; trust evolution; user preference

Research session b2: time series 4

37. Searching and mining trillions of time series subsequences under dynamic time warping.

Paper Link】 【Pages】:262-270

【Authors】: Thanawin Rakthanmanon ; Bilson J. L. Campana ; Abdullah Mueen ; Gustavo E. A. P. A. Batista ; M. Brandon Westover ; Qiang Zhu ; Jesin Zakaria ; Eamonn J. Keogh

【Abstract】: Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

【Keywords】: lower bounds; similarity search; time series

38. Fast mining and forecasting of complex time-stamped events.

Paper Link】 【Pages】:271-279

【Authors】: Yasuko Matsubara ; Yasushi Sakurai ; Christos Faloutsos ; Tomoharu Iwata ; Masatoshi Yoshikawa

【Abstract】: Given huge collections of time-evolving events such as web-click logs, which consist of multiple attributes (e.g., URL, userID, times- tamp), how do we find patterns and trends? How do we go about capturing daily patterns and forecasting future events? We need two properties: (a) effectiveness, that is, the patterns should help us understand the data, discover groups, and enable forecasting, and (b) scalability, that is, the method should be linear with the data size. We introduce TriMine, which performs three-way mining for all three attributes, namely, URLs, users, and time. Specifically TriMine discovers hidden topics, groups of URLs, and groups of users, simultaneously. Thanks to its concise but effective summarization, it makes it possible to accomplish the most challenging and important task, namely, to forecast future events. Extensive experiments on real datasets demonstrate that TriMine discovers meaningful topics and makes long-range forecasts, which are notoriously difficult to achieve. In fact, TriMine consistently outperforms the best state-of-the-art existing methods in terms of accuracy and execution speed (up to 74x faster).

【Keywords】: forecasting; tensor analysis; time-stamped events; topic model

39. Mining recent temporal patterns for event detection in multivariate time series data.

Paper Link】 【Pages】:280-288

【Authors】: Iyad Batal ; Dmitriy Fradkin ; James H. Harrison Jr. ; Fabian Moerchen ; Milos Hauskrecht

【Abstract】: Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes.

【Keywords】: event detection; patient classification; temporal abstractions; temporal pattern mining; time-interval patterns

40. A shapelet transform for time series classification.

Paper Link】 【Pages】:289-297

【Authors】: Jason Lines ; Luke M. Davis ; Jon Hills ; Anthony Bagnall

【Abstract】: The problem of time series classification (TSC), where we consider any real-valued ordered data a time series, presents a specific machine learning challenge as the ordering of variables is often crucial in finding the best discriminating features. One of the most promising recent approaches is to find shapelets within a data set. A shapelet is a time series subsequence that is identified as being representative of class membership. The original research in this field embedded the procedure of finding shapelets within a decision tree. We propose disconnecting the process of finding shapelets from the classification algorithm by proposing a shapelet transformation. We describe a means of extracting the k best shapelets from a data set in a single pass, and then use these shapelets to transform data by calculating the distances from a series to each shapelet. We demonstrate that transformation into this new data space can improve classification accuracy, whilst retaining the explanatory power provided by shapelets.

【Keywords】: filter; shapelet; time series; transformation

Research session b3: matrices and tensors 4

41. Accelerated singular value thresholding for matrix completion.

Paper Link】 【Pages】:298-306

【Authors】: Yao Hu ; Debing Zhang ; Jun Liu ; Jieping Ye ; Xiaofei He

【Abstract】: Recovering a large matrix from a small subset of its entries is a challenging problem arising in many real world applications, such as recommender system and image in-painting. These problems can be formulated as a general matrix completion problem. The Singular Value Thresholding (SVT) algorithm is a simple and efficient first-order matrix completion method to recover the missing values when the original data matrix is of low rank. SVT has been applied successfully in many applications. However, SVT is computationally expensive when the size of the data matrix is large, which significantly limits its applicability. In this paper, we propose an Accelerated Singular Value Thresholding (ASVT) algorithm which improves the convergence rate from O(1/N) for SVT to O(1/N2), where N is the number of iterations during optimization. Specifically, the dual problem of the nuclear norm minimization problem is derived and an adaptive line search scheme is introduced to solve this dual problem. Consequently, the optimal solution of the primary problem can be readily obtained from that of the dual problem. We have conducted a series of experiments on a synthetic dataset, a distance matrix dataset and a large movie rating dataset. The experimental results have demonstrated the efficiency and effectiveness of the proposed algorithm.

【Keywords】: adaptive line search scheme; matrix completion; nesterov's method; singular value thresholding

42. Fast bregman divergence NMF using taylor expansion and coordinate descent.

Paper Link】 【Pages】:307-315

【Authors】: Liangda Li ; Guy Lebanon ; Haesun Park

【Abstract】: Non-negative matrix factorization (NMF) provides a lower rank approximation of a matrix. Due to nonnegativity imposed on the factors, it gives a latent structure that is often more physically meaningful than other lower rank approximations such as singular value decomposition (SVD). Most of the algorithms proposed in literature for NMF have been based on minimizing the Frobenius norm. This is partly due to the fact that the minimization problem based on the Frobenius norm provides much more flexibility in algebraic manipulation than other divergences. In this paper we propose a fast NMF algorithm that is applicable to general Bregman divergences. Through Taylor series expansion of the Bregman divergences, we reveal a relationship between Bregman divergences and Euclidean distance. This key relationship provides a new direction for NMF algorithms with general Bregman divergences when combined with the scalar block coordinate descent method. The proposed algorithm generalizes several recently proposed methods for computation of NMF with Bregman divergences and is computationally faster than existing alternatives. We demonstrate the effectiveness of our approach with experiments conducted on artificial as well as real world data.

【Keywords】: bregman divergences; euclidean distance; non-negative matrix factorization; taylor series expansion

43. GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries.

Paper Link】 【Pages】:316-324

【Authors】: U. Kang ; Evangelos E. Papalexakis ; Abhay Harpale ; Christos Faloutsos

【Abstract】: Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conference-author-keyword relations. Tensor decomposition is an important data mining tool with various applications including clustering, trend detection, and anomaly detection. However, current tensor decomposition algorithms are not scalable for large tensors with billions of sizes and hundreds millions of nonzeros: the largest tensor in the literature remains thousands of sizes and hundreds thousands of nonzeros. Consider a knowledge base tensor consisting of about 26 million noun-phrases. The intermediate data explosion problem, associated with naive implementations of tensor decomposition algorithms, would require the materialization and the storage of a matrix whose largest dimension would be ≈7 x 1014; this amounts to ~10 Petabytes, or equivalently a few data centers worth of storage, thereby rendering the tensor analysis of this knowledge base, in the naive way, practically impossible. In this paper, we propose GIGATENSOR, a scalable distributed algorithm for large scale tensor decomposition. GIGATENSOR exploits the sparseness of the real world tensors, and avoids the intermediate data explosion problem by carefully redesigning the tensor decomposition algorithm. Extensive experiments show that our proposed GIGATENSOR solves 100 times bigger problems than existing methods. Furthermore, we employ GIGATENSOR in order to analyze a very large real world, knowledge base tensor and present our astounding findings which include discovery of potential synonyms among millions of noun-phrases (e.g. the noun 'pollutant' and the noun-phrase 'greenhouse gases').

【Keywords】: big data; distributed computing; hadoop; mapreduce; tensor

44. Active learning for online bayesian matrix factorization.

Paper Link】 【Pages】:325-333

【Authors】: Jorge G. Silva ; Lawrence Carin

【Abstract】: The problem of large-scale online matrix completion is addressed via a Bayesian approach. The proposed method learns a factor analysis (FA) model for large matrices, based on a small number of observed matrix elements, and leverages the statistical model to actively select which new matrix entries/observations would be most informative if they could be acquired, to improve the model; the model inference and active learning are performed in an online setting. In the context of online learning, a greedy, fast and provably near-optimal algorithm is employed to sequentially maximize the mutual information between past and future observations, taking advantage of submodularity properties. Additionally, a simpler procedure, which directly uses the posterior parameters learned by the Bayesian approach, is shown to achieve slightly lower estimation quality, with far less computational effort. Inference is performed using a computationally efficient online variational Bayes (VB) procedure. Competitive results are obtained in a very large collaborative filtering problem, namely the Yahoo! Music ratings dataset.

【Keywords】: matrix factorization; online learning

Research session b4: unsupervised learning 4

45. A sparsity-inducing formulation for evolutionary co-clustering.

Paper Link】 【Pages】:334-342

【Authors】: Shuiwang Ji ; Wenlu Zhang ; Jun Liu

【Abstract】: Traditional co-clustering methods identify block structures from static data matrices. However, the data matrices in many applications are dynamic; that is, they evolve smoothly over time. Consequently, the hidden block structures embedded into the matrices are also expected to vary smoothly along the temporal dimension. It is therefore desirable to encourage smoothness between the block structures identified from temporally adjacent data matrices. In this paper, we propose an evolutionary co-clustering formulation for identifying co-cluster structures from time-varying data. The proposed formulation encourages smoothness between temporally adjacent blocks by employing the fused Lasso type of regularization. Our formulation is very flexible and allows for imposing smoothness constraints over only one dimension of the data matrices, thereby enabling its applicability to a large variety of settings. The optimization problem for the proposed formulation is non-convex, non-smooth, and non-separable. We develop an iterative procedure to compute the solution. Each step of the iterative procedure involves a convex, but non-smooth and non-separable problem. We propose to solve this problem in its dual form, which is convex and smooth. This leads to a simple gradient descent algorithm for computing the dual optimal solution. We evaluate the proposed formulation using the Allen Developing Mouse Brain Atlas data. Results show that our formulation consistently outperforms methods without the temporal smoothness constraints.

【Keywords】: bioinformatics; evolutionary co-clustering; neuroscience; optimization; sparsity learning

46. Detecting changes of clustering structures using normalized maximum likelihood coding.

Paper Link】 【Pages】:343-351

【Authors】: So Hirai ; Kenji Yamanishi

【Abstract】: We are concerned with the issue of detecting changes of clustering structures from multivariate time series. From the viewpoint of the minimum description length(MDL) principle, we propose an algorithm that tracks changes of clustering structures so that the sum of the code-length for data and that for clustering changes is minimum. Here we employ a Gaussian mixture model(GMM) as representation of clustering, and compute the code-length for data sequences using the normalized maximum likelihood (NML) coding. The proposed algorithm enables us to deal with clustering dynamics including merging, splitting, emergence, disappearance of clusters from a unifying view of the MDL principle. We empirically demonstrate using artificial data sets that our proposed method is able to detect cluster changes significantly more accurately than an existing statistical-test based method and AIC/BIC-based methods. We further use real customers' transaction data sets to demonstrate the validity of our algorithm in market analysis. We show that it is able to detect changes of customer groups, which correspond to changes of real market environments.

【Keywords】: clustering; dynamic model selection; minimum description length principle; normalized maximum likelihood

47. Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data.

Paper Link】 【Pages】:352-360

【Authors】: Stephan Günnemann ; Ines Färber ; Kittipat Virochsiri ; Thomas Seidl

【Abstract】: The necessity to analyze subspace projections of complex data is a well-known fact in the clustering community. While the full space may be obfuscated by overlapping patterns and irrelevant dimensions, only certain subspaces are able to reveal the clustering structure. Subspace clustering discards irrelevant dimensions and allows objects to belong to multiple, overlapping clusters due to individual subspace projections for each set of objects. As we will demonstrate, the observations, which originate the need to consider subspace projections for traditional clustering, also apply for the task of correlation analysis. In this work, we introduce the novel paradigm of subspace correlation clustering: we analyze subspace projections to find subsets of objects showing linear correlations among this subset of dimensions. In contrast to existing techniques, which determine correlations based on the full-space, our method is able to exclude locally irrelevant dimensions, enabling more precise detection of the correlated features. Since we analyze subspace projections, each object can contribute to several correlations. Our model allows multiple overlapping clusters in general but simultaneously avoids redundant clusters deducible from already known correlations. We introduce the algorithm SSCC that exploits different pruning techniques to efficiently generate a subspace correlation clustering. In thorough experiments we demonstrate the strength of our novel paradigm in comparison to existing methods.

【Keywords】: linear correlations; overlapping clusters

48. Dependency clustering across measurement scales.

Paper Link】 【Pages】:361-369

【Authors】: Claudia Plant

【Abstract】: How to automatically spot the major trends in large amounts of heterogeneous data? Clustering can help. However, most existing techniques suffer from one or more of the following drawbacks: 1) Many techniques support only one particular data type, most commonly numerical attributes. 2) Other techniques do not support attribute dependencies which are prevalent in real data. 3) Some approaches require input parameters which are difficult to estimate. 4) Most clustering approaches lack in interpretability. To address these challenges, we present the algorithm Scenic for dependency clustering across measurement scales. Our approach seamlessly integrates heterogenous data types measured at different scales, most importantly continuous numerical and discrete categorical data. Scenic clusters by arranging objects and attributes in a cluster-specific low-dimensional space. The embedding serves as a compact cluster model allowing to reconstruct the original heterogenous attributes with high accuracy. Thereby embedding reveals the major cluster-specific mixed-type attribute dependencies. Following the Minimum Description Length (MDL) principle, the cluster-specific embedding serves as a codebook for effective data compression. This compression-based view automatically balances goodness-of-fit and model complexity, making input parameters redundant. Finally, the embedding serves as a visualization enhancing the interpretability of the clustering result. Extensive experiments demonstrate the benefits of Scenic.

【Keywords】: clustering; heterogeneous data; minimum description length

Industry/govt track b5: social network analysis 4

49. A framework for summarizing and analyzing twitter feeds.

Paper Link】 【Pages】:370-378

【Authors】: Xintian Yang ; Amol Ghoting ; Yiye Ruan ; Srinivasan Parthasarathy

【Abstract】: The firehose of data generated by users on social networking and microblogging sites such as Facebook and Twitter is enormous. Real-time analytics on such data is challenging with most current efforts largely focusing on the efficient querying and retrieval of data produced recently. In this paper, we present a dynamic pattern driven approach to summarize data produced by Twitter feeds. We develop a novel approach to maintain an in-memory summary while retaining sufficient information to facilitate a range of user-specific and topic-specific temporal analytics. We empirically compare our approach with several state-of-the-art pattern summarization approaches along the axes of storage cost, query accuracy, query flexibility, and efficiency using real data from Twitter. We find that the proposed approach is not only scalable but also outperforms existing approaches by a large margin.

【Keywords】: analytics; data summarization; twitter

50. Entity-centric topic-oriented opinion summarization in twitter.

Paper Link】 【Pages】:379-387

【Authors】: Xinfan Meng ; Furu Wei ; Xiaohua Liu ; Ming Zhou ; Sujian Li ; Houfeng Wang

【Abstract】: Microblogging services, such as Twitter, have become popular channels for people to express their opinions towards a broad range of topics. Twitter generates a huge volume of instant messages (i.e. tweets) carrying users' sentiments and attitudes every minute, which both necessitates automatic opinion summarization and poses great challenges to the summarization system. In this paper, we study the problem of opinion summarization for entities, such as celebrities and brands, in Twitter. We propose an entity-centric topic-based opinion summarization framework, which aims to produce opinion summaries in accordance with topics and remarkably emphasizing the insight behind the opinions. To this end, we first mine topics from #hashtags, the human-annotated semantic tags in tweets. We integrate the #hashtags as weakly supervised information into topic modeling algorithms to obtain better interpretation and representation for calculating the similarity among them, and adopt Affinity Propagation algorithm to group #hashtags into coherent topics. Subsequently, we use templates generalized from paraphrasing to identify tweets with deep insights, which reveal reasons, express demands or reflect viewpoints. Afterwards, we develop a target (i.e. entity) dependent sentiment classification approach to identifying the opinion towards a given target (i.e. entity) of tweets. Finally, the opinion summary is generated through integrating information from dimensions of topic, opinion and insight, as well as other factors (e.g. topic relevancy, redundancy and language styles) in an unified optimization framework. We conduct extensive experiments on a real-life data set to evaluate the performance of individual opinion summarization modules as well as the quality of the produced summary. The promising experiment results show the effectiveness of the proposed framework and algorithms.

【Keywords】:

hashtag; Twitter; opinion summarization; sentiment analysis; topic analysis

51. Community discovery and profiling with social messages.

Paper Link】 【Pages】:388-396

【Authors】: Wenjun Zhou ; Hongxia Jin ; Yan Liu

【Abstract】: Discovering communities from social media and collaboration systems has been of great interest in recent years. Existing work show prospects of modeling contents and social links, aiming at discovering social communities, whose definition varies by application. We believe that a community depends not only on the group of people who actively participate, but also the topics they communicate about or collaborate on. This is especially true for workplace email communications. Within an organization, it is not uncommon that employees multifunction, and groups of employees collaborate on multiple projects at the same time. In this paper, we aim to automatically discovering and profiling users' communities by taking into account both the contacts and the topics. More specifically, we propose a community profiling model called COCOMP, where the communities labels are latent, and each social document corresponds to an information sharing activity among the most probable community members regarding the most relevant community issues. Experiment results on several social communication datasets, including emails and Twitter messages, demonstrate that the model can discover users' communities effectively, and provide concrete semantics.

【Keywords】: collaboration; community discovery; email; generative models; social media

Paper Link】 【Pages】:397-405

【Authors】: Ziad Al Bawab ; George H. Mills ; Jean-François Crespo

【Abstract】: In this paper, we present our approach for geographic personalization of a content recommendation system. More specifically, our work focuses on recommending query topics to users. We do this by mining the search query logs to detect trending local topics. For a set of queries we compute their counts and what we call buzz scores, which is a metric for detecting trending behavior. We also compute the entropy of the geographic distribution of the queries as means of detecting their location affinity. We cluster the queries into trending topics and assign the topics to their corresponding location. Human editors then select a subset of these local topics and enter them into a recommendation system. In turn the recommendation system optimizes a pool of trending local and global topics by exploiting user feedback. We present some editorial evaluation of the technique and results of a live experiment. Inclusion of local topics in selected locations into the global pool of topics resulted in more than 6% relative increase in user engagement with the recommendation system compared to using the global topics exclusively.

【Keywords】: clustering; content recommendation; location affinity; personalization; trending search queries

Industrial practice expo b6: session 1 2

53. China's national personal credit scoring system: a real-life intelligent knowledge application.

Paper Link】 【Pages】:406

【Authors】: Yong Shi

【Abstract】: Credit Reference Centre (CRC) of People's Bank of China (PBC) has built a big data: the largest personal credit database in the world with 800 million people's accounts collected from all commercial banks in China since 2003. From June 2006 to Sept 2009, Research Centre on Fictitious Economy and Data Science, Chinese Academy of Sciences (CASFEDS) and CRC jointly developed China's National Personal Credit Scoring System, known as "China Score", which is a unique and advanced KDD application under intelligent knowledge management on this big data. The system will be eventually serving all 1.3 billion population of China for their daily financial activities, such as bank accounts, credit card application, mortgage, personal loans, etc. It can become one of the most influential events of KDD techniques to human kind. This talk will introduce the key components of China Score project that includes objectives, modeling process, KDD techniques used in the projects, intelligent knowledge management and experience of the project development. In addition, the talk will also outline a number of policy recommendations based on China Score project which has been potentially impacting Chinese Government on its strategic decision making for China's economic developments.

【Keywords】: credit scoring; knowledge management; risk models

54. Maximizing return and minimizing cost with the right decision management systems.

Paper Link】 【Pages】:407

【Authors】: Rich Holada

【Abstract】: The ability to achieve operational efficiency, product leadership, and customer intimacy still eludes many organizations due, in large part, to the chaos of business. Inconsistent prioritization and decision making; poor visibility between systems; processes that are not well controlled; and individual front-line decisions that seem small but, in totality, have a huge impact make it difficult for organizations to link strategy to execution and back. During this presentation, we will demonstrate how automating and optimizing decisions (operational efficiency) with business rules and predictive models enables better data driven results across the enterprise, and how this is implemented at the point of impact (customer intimacy) to transform an organization and support market leadership.

【Keywords】: CRM; ROI; automated decision management; predictive analytics

Research session c1: social and web mining applications 4

55. Efficient and domain-invariant competitor mining.

Paper Link】 【Pages】:408-416

【Authors】: Theodoros Lappas ; George Valkanas ; Dimitrios Gunopulos

【Abstract】: In any competitive business, success is based on the ability to make an item more appealing to customers than the competition. A number of questions arise in the context of this task: how do we formalize and quantify the competitiveness relationship between two items? Who are the true competitors of a given item? What are the features of an item that most affect its competitiveness? Despite the impact and relevance of this problem to many domains, only a limited amount of work has been devoted toward an effective solution. In this paper, we present a formal definition of the competitiveness between two items. We present efficient methods for evaluating competitiveness in large datasets and address the natural problem of finding the top-k competitors of a given item. Our methodology is evaluated against strong baselines via a user study and experiments on multiple datasets from different domains.

【Keywords】: competitor mining; competitors; domain-invariant

56. Discriminative clustering for market segmentation.

Paper Link】 【Pages】:417-425

【Authors】: Peter Haider ; Luca Chiarandini ; Ulf Brefeld

【Abstract】: We study discriminative clustering for market segmentation tasks. The underlying problem setting resembles discriminative clustering, however, existing approaches focus on the prediction of univariate cluster labels. By contrast, market segments encode complex (future) behavior of the individuals which cannot be represented by a single variable. In this paper, we generalize discriminative clustering to structured and complex output variables that can be represented as graphical models. We devise two novel methods to jointly learn the classifier and the clustering using alternating optimization and collapsed inference, respectively. The two approaches jointly learn a discriminative segmentation of the input space and a generative output prediction model for each segment. We evaluate our methods on segmenting user navigation sequences from Yahoo! News. The proposed collapsed algorithm is observed to outperform baseline approaches such as mixture of experts. We showcase exemplary projections of the resulting segments to display the interpretability of the solutions.

【Keywords】: discriminative clustering; market segmentation

57. Interacting viruses in networks: can both survive?

Paper Link】 【Pages】:426-434

【Authors】: Alex Beutel ; B. Aditya Prakash ; Roni Rosenfeld ; Christos Faloutsos

【Abstract】: Suppose we have two competing ideas/products/viruses, that propagate over a social or other network. Suppose that they are strong/virulent enough, so that each, if left alone, could lead to an epidemic. What will happen when both operate on the network? Earlier models assume that there is perfect competition: if a user buys product 'A' (or gets infected with virus 'X'), she will never buy product 'B' (or virus 'Y'). This is not always true: for example, a user could install and use both Firefox and Google Chrome as browsers. Similarly, one type of flu may give partial immunity against some other similar disease. In the case of full competition, it is known that 'winner takes all,' that is the weaker virus/product will become extinct. In the case of no competition, both viruses survive, ignoring each other. What happens in-between these two extremes? We show that there is a phase transition: if the competition is harsher than a critical level, then 'winner takes all;' otherwise, the weaker virus survives. These are the contributions of this paper (a) the problem definition, which is novel even in epidemiology literature (b) the phase-transition result and (c) experiments on real data, illustrating the suitability of our results.

【Keywords】: cascades; co-existence; competition; epidemics

58. Aggregating web offers to determine product prices.

Paper Link】 【Pages】:435-443

【Authors】: Rakesh Agrawal ; Samuel Ieong

【Abstract】: Historical prices are important information that can help consumers decide whether the time is right to buy a product. They provide both a context to the users, and facilitate the use of prediction algorithms for forecasting future prices. To produce a representative price history, one needs to consider all offers for the product. However, matching offers to a product is a challenging problem, and mismatches could lead to glaring errors in price history. We propose a principled approach to filter out erroneous matches based on a probabilistic model of prices. We give an efficient algorithm for performing inference that takes advantage of the structure of the problem. We evaluate our results empirically using merchant offers collected from a search engine, and measure the proximity of the price history generated by our approach to the true price history. Our method outperforms alternatives based on robust statistics both in tracking the true price levels and the true price trends.

【Keywords】: data aggregation; time series analysis

Research session c2: event mining 4

59. Mining event periodicity from incomplete observations.

Paper Link】 【Pages】:444-452

【Authors】: Zhenhui Li ; Jingjing Wang ; Jiawei Han

【Abstract】: Advanced technology in GPS and sensors enables us to track physical events, such as human movements and facility usage. Periodicity analysis from the recorded data is an important data mining task which provides useful insights into the physical events and enables us to report outliers and predict future behaviors. To mine periodicity in an event, we have to face real-world challenges of inherently complicated periodic behaviors and imperfect data collection problem. Specifically, the hidden temporal periodic behaviors could be oscillating and noisy, and the observations of the event could be incomplete. In this paper, we propose a novel probabilistic measure for periodicity and design a practical method to detect periods. Our method has thoroughly considered the uncertainties and noises in periodic behaviors and is provably robust to incomplete observations. Comprehensive experiments on both synthetic and real datasets demonstrate the effectiveness of our method.

【Keywords】: incomplete observations; periodicity

60. Towards heterogeneous temporal clinical event pattern discovery: a convolutional approach.

Paper Link】 【Pages】:453-461

【Authors】: Fei Wang ; Noah Lee ; Jianying Hu ; Jimeng Sun ; Shahram Ebadollahi

【Abstract】: Large collections of electronic clinical records today provide us with a vast source of information on medical practice. However, the utilization of those data for exploratory analysis to support clinical decisions is still limited. Extracting useful patterns from such data is particularly challenging because it is longitudinal, sparse and heterogeneous. In this paper, we propose a Nonnegative Matrix Factorization (NMF) based framework using a convolutional approach for open-ended temporal pattern discovery over large collections of clinical records. We call the method One-Sided Convolutional NMF (OSC-NMF). Our framework can mine common as well as individual shift-invariant temporal patterns from heterogeneous events over different patient groups, and handle sparsity as well as scalability problems well. Furthermore, we use an event matrix based representation that can encode quantitatively all key temporal concepts including order, concurrency and synchronicity. We derive efficient multiplicative update rules for OSC-NMF, and also prove theoretically its convergence. Finally, the experimental results on both synthetic and real world electronic patient data are presented to demonstrate the effectiveness of the proposed method.

【Keywords】: convolution; nmf; pattern discovery

61. The long and the short of it: summarising event sequences with serial episodes.

Paper Link】 【Pages】:462-470

【Authors】: Nikolaj Tatti ; Jilles Vreeken

【Abstract】: An ideal outcome of pattern mining is a small set of informative patterns, containing no redundancy or noise, that identifies the key structure of the data at hand. Standard frequent pattern miners do not achieve this goal, as due to the pattern explosion typically very large numbers of highly redundant patterns are returned. We pursue the ideal for sequential data, by employing a pattern set mining approach - an approach where, instead of ranking patterns individually, we consider results as a whole. Pattern set mining has been successfully applied to transactional data, but has been surprisingly understudied for sequential data. In this paper, we employ the MDL principle to identify the set of sequential patterns that summarises the data best. In particular, we formalise how to encode sequential data using sets of serial episodes, and use the encoded length as a quality score. As search strategy, we propose two approaches: the first algorithm selects a good pattern set from a large candidate set, while the second is a parameter-free any-time algorithm that mines pattern sets directly from the data. Experimentation on synthetic and real data demonstrates we efficiently discover small sets of informative patterns.

【Keywords】: event sequence; pattern mining; pattern set mining; serial episodes

62. Efficient event pattern matching with match windows.

Paper Link】 【Pages】:471-479

【Authors】: Bruno Cadonna ; Johann Gamper ; Michael H. Böhlen

【Abstract】: In event pattern matching a sequence of input events is matched against a complex query pattern that specifies constraints on extent, order, values, and quantification of matching events. In this paper we propose a general pattern matching strategy that consists of a pre-processing step and a pattern matching step. Instead of eagerly matching incoming events, the pre-processing step buffers events in a match window to apply different pruning techniques (filtering, partitioning, and testing for necessary match conditions). In the second step, an event pattern matching algorithm, A, is called only for match windows that satisfy the necessary match conditions. This two-phase strategy with a lazy call of the matching algorithm significantly reduces the number of events that need to be processed by A as well as the number of calls to A. This is important since pattern matching algorithms tend to be expensive in terms of runtime and memory complexity, whereas the pre-processing can be done very efficiently. We conduct extensive experiments using real-world data with pattern matching algorithms for, respectively, automata and join trees. The experimental results confirm the effectiveness of our strategy for both types of pattern matching algorithms.

【Keywords】: event pattern matching; optimization; window

Research session c3: matrix approximation 4

63. Optimal exact least squares rank minimization.

Paper Link】 【Pages】:480-488

【Authors】: Shuo Xiang ; Yunzhang Zhu ; Xiaotong Shen ; Jieping Ye

【Abstract】: In multivariate analysis, rank minimization emerges when a low-rank structure of matrices is desired as well as a small estimation error. Rank minimization is nonconvex and generally NP-hard, imposing one major challenge. In this paper, we consider a nonconvex least squares formulation, which seeks to minimize the least squares loss function with the rank constraint. Computationally, we develop efficient algorithms to compute a global solution as well as an entire regularization solution path. Theoretically, we show that our method reconstructs the oracle estimator exactly from noisy data. As a result, it recovers the true rank optimally against any method and leads to sharper parameter estimation over its counterpart. Finally, the utility of the proposed method is demonstrated by simulations and image reconstruction from noisy background.

【Keywords】: global optimality; nonconvex; rank minimization

64. Large-scale distributed non-negative sparse coding and sparse dictionary learning.

Paper Link】 【Pages】:489-497

【Authors】: Vikas Sindhwani ; Amol Ghoting

【Abstract】: We consider the problem of building compact, unsupervised representations of large, high-dimensional, non-negative data using sparse coding and dictionary learning schemes, with an emphasis on executing the algorithm in a Map-Reduce environment. The proposed algorithms may be seen as parallel optimization procedures for constructing sparse non-negative factorizations of large, sparse matrices. Our approach alternates between a parallel sparse coding phase implemented using greedy or convex (l1) regularized risk minimization procedures, and a sequential dictionary learning phase where we solve a set of l0 optimization problems exactly. These two-fold sparsity constraints lead to better statistical performance on text analysis tasks and at the same time make it possible to implement each iteration in a single Map-Reduce job. We detail our implementations and optimizations that lead to the ability to factor matrices with more than 100 million rows and billions of non-zero entries in just a few hours on a small commodity cluster.

【Keywords】: dictionary learning; mapreduce; nmf; sparse coding

65. Learning binary codes for collaborative filtering.

Paper Link】 【Pages】:498-506

【Authors】: Ke Zhou ; Hongyuan Zha

【Abstract】: This paper tackles the efficiency problem of making recommendations in the context of large user and item spaces. In particular, we address the problem of learning binary codes for collaborative filtering, which enables us to efficiently make recommendations with time complexity that is independent of the total number of items. We propose to construct binary codes for users and items such that the preference of users over items can be accurately preserved by the Hamming distance between their respective binary codes. By using two loss functions measuring the degree of divergence between the training and predicted ratings, we formulate the problem of learning binary codes as a discrete optimization problem. Although this optimization problem is intractable in general, we develop effective relaxations that can be efficiently solved by existing methods. Moreover, we investigate two methods to obtain the binary codes from the relaxed solutions. Evaluations are conducted on three public-domain data sets and the results suggest that our proposed method outperforms several baseline alternatives.

【Keywords】: collaborative filtering; discrete optimization; learning binary codes; recommender systems; relaxed solutions

66. Low rank modeling of signed networks.

Paper Link】 【Pages】:507-515

【Authors】: Cho-Jui Hsieh ; Kai-Yang Chiang ; Inderjit S. Dhillon

【Abstract】: Trust networks, where people leave trust and distrust feedback, are becoming increasingly common. These networks may be regarded as signed graphs, where a positive edge weight captures the degree of trust while a negative edge weight captures the degree of distrust. Analysis of such signed networks has become an increasingly important research topic. One important analysis task is that of sign inference, i.e., infer unknown (or future) trust or distrust relationships given a partially observed signed network. Most state-of-the-art approaches consider the notion of structural balance in signed networks, building inference algorithms based on information about links, triads, and cycles in the network. In this paper, we first show that the notion of weak structural balance in signed networks naturally leads to a global low-rank model for the network. Under such a model, the sign inference problem can be formulated as a low-rank matrix completion problem. We show that we can perfectly recover missing relationships, under certain conditions, using state-of-the-art matrix completion algorithms. We also propose the use of a low-rank matrix factorization approach with generalized loss functions as a practical method for sign inference - this approach yields high accuracy while being scalable to large signed networks, for instance, we show that this analysis can be performed on a synthetic graph with 1.1 million nodes and 120 million edges in 10 minutes. We further show that the low-rank model can be used for other analysis tasks on signed networks, such as user segmentation through signed graph clustering, with theoretical guarantees. Experiments on synthetic as well as real data show that our low rank model substantially improves accuracy of sign inference as well as clustering. As an example, on the largest real dataset available to us (Epinions data with 130K nodes and 840K edges), our matrix factorization approach yields 94.6% accuracy on the sign inference task as compared to 90.8% accuracy using a state-of-the-art cycle-based method - moreover, our method runs in 40 seconds as compared to 10,000 seconds for the cycle-based method.

【Keywords】: low rank model; signed networks; structural balance

Research session c4: supervised learning with multivariate data 4

67. A structural cluster kernel for learning on graphs.

Paper Link】 【Pages】:516-524

【Authors】: Madeleine Seeland ; Andreas Karwath ; Stefan Kramer

【Abstract】: In recent years, graph kernels have received considerable interest within the machine learning and data mining community. Here, we introduce a novel approach enabling kernel methods to utilize additional information hidden in the structural neighborhood of the graphs under consideration. Our novel structural cluster kernel (SCK) incorporates similarities induced by a structural clustering algorithm to improve state-of-the-art graph kernels. The approach taken is based on the idea that graph similarity can not only be described by the similarity between the graphs themselves, but also by the similarity they possess with respect to their structural neighborhood. We applied our novel kernel in a supervised and a semi-supervised setting to regression and classification problems on a number of real-world datasets of molecular graphs. Our results show that the structural cluster similarity information can indeed leverage the prediction performance of the base kernel, particularly when the dataset is structurally sparse and consequently structurally diverse. By additionally taking into account a large number of unlabeled instances the performance of the structural cluster kernel can further be improved.

【Keywords】: cheminformatics; cluster kernels; graph kernels; qsar; structural graph clustering

68. Multi-label hypothesis reuse.

Paper Link】 【Pages】:525-533

【Authors】: Sheng-Jun Huang ; Yang Yu ; Zhi-Hua Zhou

【Abstract】: Multi-label learning arises in many real-world tasks where an object is naturally associated with multiple concepts. It is well-accepted that, in order to achieve a good performance, the relationship among labels should be exploited. Most existing approaches require the label relationship as prior knowledge, or exploit by counting the label co-occurrence. In this paper, we propose the MAHR approach, which is able to automatically discover and exploit label relationship. Our basic idea is that, if two labels are related, the hypothesis generated for one label can be helpful for the other label. MAHR implements the idea as a boosting approach with a hypothesis reuse mechanism. In each boosting round, the base learner for a label is generated by not only learning on its own task but also reusing the hypotheses from other labels, and the amount of reuse across labels provides an estimate of the label relationship. Extensive experimental results validate that MAHR is able to achieve superior performance and discover reasonable label relationship. Moreover, we disclose that the label relationship is usually asymmetric.

【Keywords】: hypothesis reuse; label relationship; multi-label learning

69. Rank-loss support instance machines for MIML instance annotation.

Paper Link】 【Pages】:534-542

【Authors】: Forrest Briggs ; Xiaoli Z. Fern ; Raviv Raich

【Abstract】: Multi-instance multi-label learning (MIML) is a framework for supervised classification where the objects to be classified are bags of instances associated with multiple labels. For example, an image can be represented as a bag of segments and associated with a list of objects it contains. Prior work on MIML has focused on predicting label sets for previously unseen bags. We instead consider the problem of predicting instance labels while learning from data labeled only at the bag level. We propose Rank-Loss Support Instance Machines, which optimize a regularized rank-loss objective and can be instantiated with different aggregation models connecting instance-level predictions with bag-level predictions. The aggregation models that we consider are equivalent to defining a "support instance" for each bag, which allows efficient optimization of the rank-loss objective using primal sub-gradient descent. Experiments on artificial and real-world datasets show that the proposed methods achieve higher accuracy than other loss functions used in prior work, e.g., Hamming loss, and recent work in ambiguous label classification.

【Keywords】: bioacoustics; image annotation; instance annotation; multi-instance; multi-label; sub-gradient; support vector machine

70. Inductive multi-task learning with multiple view data.

Paper Link】 【Pages】:543-551

【Authors】: Jintao Zhang ; Jun Huan

【Abstract】: In many real-world applications, it is becoming common to have data extracted from multiple diverse sources, known as "multi-view" data. Multi-view learning (MVL) has been widely studied in many applications, but existing MVL methods learn a single task individually. In this paper, we study a new direction of multi-view learning where there are multiple related tasks with multi-view data (i.e. multi-view multi-task learning, or MVMT Learning). In our MVMT learning methods, we learn a linear mapping for each view in each task. In a single task, we use co-regularization to obtain functions that are in-agreement with each other on the unlabeled samples and achieve low classification errors on the labeled samples simultaneously. Cross different tasks, additional regularization functions are utilized to ensure the functions that we learn in each view are similar. We also developed two extensions of the MVMT learning algorithm. One extension handles missing views and the other handles non-uniformly related tasks. Experimental studies on three real-world data sets demonstrate that our MVMT methods significantly outperform the existing state-of-the-art methods.

【Keywords】: co-regularization; inductive multi-task learning; missing view; multi-view learning; task relationship learning

Industry/govt track c5: web applications 4

71. Scalable misbehavior detection in online video chat services.

Paper Link】 【Pages】:552-560

【Authors】: Xinyu Xing ; Yu-Li Liang ; Sui Huang ; Hanqiang Cheng ; Richard Han ; Qin Lv ; Xue Liu ; Shivakant Mishra ; Yi Zhu

【Abstract】: The need for highly scalable and accurate detection and filtering of misbehaving users and obscene content in online video chat services has grown as the popularity of these services has exploded in popularity. This is a challenging problem because processing large amounts of video is compute intensive, decisions about whether a user is misbehaving or not must be made online and quickly, and moreover these video chats are characterized by low quality video, poorly lit scenes, diversity of users and their behaviors, diversity of the content, and typically short sessions. This paper presents EMeralD, a highly scalable system for accurately detecting and filtering misbehaving users in online video chat applications. EMeralD substantially improves upon the state-of-the-art filtering mechanisms by achieving much lower computational cost and higher accuracy. We demonstrate EMeralD's improvement via experimental evaluations on real-world data sets obtained from Chatroulette.com.

【Keywords】: misbehavior detection; online video chat; video safety

72. Keyword-propagation-based information enriching and noise removal for web news videos.

Paper Link】 【Pages】:561-569

【Authors】: Jun Zhang ; Xiaoming Fan ; Jianyong Wang ; Lizhu Zhou

【Abstract】: The volume of Web videos have increased sharply through the past several years because of the evolvement of Web video sites.Enhanced algorithms on retrieval, classification and TDT (abbreviation of Topic Detection and Tracking) can bring lots of convenience to Web users as well as release tedious work from the administrators. Nevertheless, due to the the insufficiency of annotation keywords and the gap between video features and semantic concepts, it is still far away from satisfactory to implement them based on initial keywords and visual features. In this paper we utilize a keyword propagation algorithm based on manifold structure to enrich the keyword information and remove the noise for videos. Both text similarity and temporal similarity are employed to explore the relationship between any pair of videos and to construct the propagation model. We explore three applications, i.e., TDT, Retrieval and Classification based on a Web news video dataset obtained from a famous online video-distributing website, YouKu, and evaluate our approach. Experimental results demonstrate that they achieve satisfactory performance and always outperform the baseline methods.

【Keywords】: information enrichment; keyword propagation; noise removal

73. Harnessing the wisdom of the crowds for accurate web page clipping.

Paper Link】 【Pages】:570-578

【Authors】: Lei Zhang ; Linpeng Tang ; Ping Luo ; Enhong Chen ; Limei Jiao ; Min Wang ; Guiquan Liu

【Abstract】: Clipping Web pages, namely extracting the informative clips (areas) from Web pages, has many applications, such as Web printing and e-reading on small handheld devices. Although many existing methods attempt to address this task, most of them can either work only on certain types of Web pages (e.g., news- and blog-like web pages), or perform semi-automatically where extra user efforts are required in adjusting the outputs. The problem of clipping any types of Web pages accurately in a totally automatic way remains pretty much open. To this end in this study we harness the wisdom of the crowds to provide accurate recommendation of informative clips on any given Web pages. Specifically, we leverage the knowledge on how previous users clip similar Web pages, and this knowledge repository can be represented as a transaction database where each transaction contains the clips selected by a user on a certain Web page. Then, we formulate a new pattern mining problem, mining top-1 qualified pattern, on transaction database for this recommendation. Here, the recommendation considers not only the pattern support but also the pattern occupancy (proposed in this work). High support requires that patterns appear frequently in the database, while high occupancy requires that patterns occupy a large portion of the transactions they appear in. Thus, it leads to both precise and complete recommendation. Additionally, we explore the properties on occupancy to further prune the search space for high-efficient pattern mining. Finally, we show the effectiveness of the proposed algorithm on a human-labeled ground truth dataset consisting of 2000 web pages from 100 major Web sites, and demonstrate its efficiency on large synthetic datasets.

【Keywords】: frequent and dominant pattern; occupancy; web page clipping; wisdom of the crowds

74. Bootstrapped language identification for multi-site internet domains.

Paper Link】 【Pages】:579-585

【Authors】: Uwe F. Mayer

【Abstract】: We present an algorithm for language identification, in particular of short documents, for the case of an Internet domain with sites in multiple countries with differing languages. The algorithm is significantly faster than standard language identification methods, while providing state-of-the-art identification. We bootstrap the algorithm based on the language identification based on the site alone, a methodology suitable for any supervised language identification algorithm. We demonstrate the bootstrapping and algorithm on eBay email data and on Twitter status updates data. The algorithm is deployed at eBay as part of the back-office development data repository.

【Keywords】: boosting; language identification; large data; statistical model

Industrial practice expo c6: session 2 2

Paper Link】 【Pages】:586

【Authors】: Wei-Ying Ma

【Abstract】: In history, the Moore's law effect has been used to describe phenomena of exponential improvement in technology when it has a virtuous cycle that makes technology improvement proportional to technology itself. For example, chip performance had doubled every 18-24 months because better processors support the development of better layout tools that support the development of even better processors. I will describe a new Moore's law effect that is being created in knowledge engineering and is driven by the self-reinforcing nature of three trends and technical advancements: big data, machine learning, and Internet economics. I will explain how we can take advantage of this new effect to develop a new generation of semantic and knowledge-based search engines. Specifically, my presentation will cover the following three areas: Knowledge acquisition - our goal is to build a comprehensive entity graph and knowledge graph to complement the web and social graphs. I will introduce techniques for entity extraction and knowledgebase construction through automatic and interactive mining and crowdsourcing. Managing knowledge - our goal is to support advanced analytical queries by combining probabilistic knowledge with a distributed platform. I will focus on technology for both online knowledge serving and offline knowledge inference. Knowledge-empowered search and applications - the knowledge we have acquired and curated enables applications like query understanding, entity-centric search experiences, and answers to natural language queries.

【Keywords】: knowledge acquisition; query understanding; semantic search

76. Key lessons learned building recommender systems for large-scale social networks.

Paper Link】 【Pages】:587

【Authors】: Christian Posse

【Abstract】: By helping members to connect, discover and share relevant content or find a new career opportunity, recommender systems have become a critical component of user growth and engagement for social networks. The multidimensional nature of engagement and diversity of members on large-scale social networks have generated new infrastructure and modeling challenges and opportunities in the development, deployment and operation of recommender systems. This presentation will address some of these issues, focusing on the modeling side for which new research is much needed while describing a recommendation platform that enables real-time recommendation updates at scale as well as batch computations, and cross-leverage between different product recommendations. Topics covered on the modeling side will include optimizing for multiple competing objectives, solving contradicting business goals, modeling user intent and interest to maximize placement and timeliness of the recommendations, utility metrics beyond CTR that leverage both real-time tracking of explicit and implicit user feedback, gathering training data for new product recommendations, virality preserving online testing and virtual profiling.

【Keywords】: multi-objective optimization; online testing; real-time updates; recommender systems; user intent modeling

Research session a1: community mining 5

77. Magnet community identification on social networks.

Paper Link】 【Pages】:588-596

【Authors】: Guan Wang ; Yuchen Zhao ; Xiaoxiao Shi ; Philip S. Yu

【Abstract】: Social communities connect people of similar interests together and play essential roles in social network applications. Examples of such communities include people who like the same objects on Facebook, follow common subjects on Twitter, or join similar groups on LinkedIn. Among communities, we notice that some of them are {\em magnetic} to people. A {\em magnet community} is such a community that attracts significantly more people's interests and attentions than other communities of similar topics. With the explosive number of self-formed communities in social networks, one important demand is to identify magnet communities for users. This can not only track attractive communities, but also help improve user experiences and increase their engagements, e.g., the login frequencies and user-generated-content qualities. In this paper, we initiate the study of magnet community identification problem. First we observe several properties of magnet communities, such as attention flow, attention qualify, and attention persistence. Second, we formalize these properties with the combination of community feature extraction into a graph ranking formulation based on constraint quadratic programming. In details, we treat communities of a network as super nodes, and their interactions as links among those super nodes. Therefore, a network of communities is defined. We extract community's magnet features from heterogeneous sources, i.e., a community's standalone features and its dependency features with other communities. A graph ranking model is formulated given these features. Furthermore, we define constraints reflecting communities' magnet properties to regularize the model. We demonstrate the effectiveness of our framework on real world social network data.

【Keywords】: magnet community; social networks

78. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods.

Paper Link】 【Pages】:597-605

【Authors】: David F. Gleich ; C. Seshadhri

【Abstract】: The communities of a social network are sets of vertices with more connections inside the set than outside. We theoretically demonstrate that two commonly observed properties of social networks, heavy-tailed degree distributions and large clustering coefficients, imply the existence of vertex neighborhoods (also known as egonets) that are themselves good communities. We evaluate these neighborhood communities on a range of graphs. What we find is that the neighborhood communities can exhibit conductance scores that are as good as the Fiedler cut. Also, the conductance of neighborhood communities shows similar behavior as the network community profile computed with a personalized PageRank community detection method. Neighborhood communities give us a simple and powerful heuristic for speeding up local partitioning methods. Since finding good seeds for the PageRank clustering method is difficult, most approaches involve an expensive sweep over a great many starting vertices. We show how to use neighborhood communities to quickly generate a small set of seeds.

【Keywords】: clustering coefficients; conductance; egonets; triangles

79. Overlapping community detection via bounded nonnegative matrix tri-factorization.

Paper Link】 【Pages】:606-614

【Authors】: Yu Zhang ; Dit-Yan Yeung

【Abstract】: Complex networks are ubiquitous in our daily life, with the World Wide Web, social networks, and academic citation networks being some of the common examples. It is well understood that modeling and understanding the network structure is of crucial importance to revealing the network functions. One important problem, known as community detection, is to detect and extract the community structure of networks. More recently, the focus in this research topic has been switched to the detection of overlapping communities. In this paper, based on the matrix factorization approach, we propose a method called bounded nonnegative matrix tri-factorization (BNMTF). Using three factors in the factorization, we can explicitly model and learn the community membership of each node as well as the interaction among communities. Based on a unified formulation for both directed and undirected networks, the optimization problem underlying BNMTF can use either the squared loss or the generalized KL-divergence as its loss function. In addition, to address the sparsity problem as a result of missing edges, we also propose another setting in which the loss function is defined only on the observed edges. We report some experiments on real-world datasets to demonstrate the superiority of BNMTF over other related matrix factorization methods.

【Keywords】: bnmtf; community detection; network analysis; nmf

80. DEMON: a local-first discovery method for overlapping communities.

Paper Link】 【Pages】:615-623

【Authors】: Michele Coscia ; Giulio Rossetti ; Fosca Giannotti ; Dino Pedreschi

【Abstract】: Community discovery in complex networks is an interesting problem with a number of applications, especially in the knowledge extraction task in social and information networks. However, many large networks often lack a particular community organization at a global level. In these cases, traditional graph partitioning algorithms fail to let the latent knowledge embedded in modular structure emerge, because they impose a top-down global view of a network. We propose here a simple local-first approach to community discovery, able to unveil the modular organization of real complex networks. This is achieved by democratically letting each node vote for the communities it sees surrounding it in its limited view of the global system, i.e. its ego neighborhood, using a label propagation algorithm; finally, the local communities are merged into a global collection. We tested this intuition against the state-of-the-art overlapping and non-overlapping community discovery methods, and found that our new method clearly outperforms the others in the quality of the obtained communities, evaluated by using the extracted communities to predict the metadata about the nodes of several real world networks. We also show how our method is deterministic, fully incremental, and has a limited time complexity, so that it can be used on web-scale real networks.

【Keywords】: community discovery; complex networks; data mining

81. On the separability of structural classes of communities.

Paper Link】 【Pages】:624-632

【Authors】: Bruno D. Abrahao ; Sucheta Soundarajan ; John E. Hopcroft ; Robert Kleinberg

【Abstract】: Three major factors govern the intricacies of community extraction in networks: (1) the application domain includes a wide variety of networks of fundamentally different natures, (2) the literature offers a multitude of disparate community detection algorithms, and (3) there is no consensus characterizing how to discriminate communities from non-communities. In this paper, we present a comprehensive analysis of community properties through a class separability framework. Our approach enables the assessement of the structural dissimilarity among the output of multiple community detection algorithms and between the output of algorithms and communities that arise in practice. To demostrate this concept, we furnish our method with a large set of structural properties and multiple community detection algorithms. Applied to a diverse collection of large scale network datasets, the analysis reveals that (1) the different detection algorithms extract fundamentally different structures; (2) the structure of communities that arise in practice is closest to that of communities that random-walk-based algorithms extract, although still siginificantly different from that of the output of all the algorithms; and (3) a small subset of the properties are nearly as discriminative as the full set, while making explicit the ways in which the algorithms produce biases. Our framework enables an informed choice of the most suitable community detection method for a given purpose and network and allows for a comparison of existing community detection algorithms while guiding the design of new ones.

【Keywords】: class separability; community structure; detection algorithms; networks

Research session a2: sequential and spatio-temporal patterns 5

82. Discovering lag intervals for temporal dependencies.

Paper Link】 【Pages】:633-641

【Authors】: Liang Tang ; Tao Li ; Larisa Shwartz

【Abstract】: Time lag is a key feature of hidden temporal dependencies within sequential data. In many real-world applications, time lag plays an essential role in interpreting the cause of discovered temporal dependencies. Traditional temporal mining methods either use a predefined time window to analyze the item sequence, or employ statistical techniques to simply derive the time dependencies among items. Such paradigms cannot effectively handle varied data with special properties, e.g., the interleaved temporal dependencies. In this paper, we study the problem of finding lag intervals for temporal dependency analysis. We first investigate the correlations between the temporal dependencies and other temporal patterns, and then propose a generalized framework to resolve the problem. By utilizing the sorted table in representing time lags among items, the proposed algorithm achieves an elegant balance between the time cost and the space cost. Extensive empirical evaluation on both synthetic and real data sets demonstrates the efficiency and effectiveness of our proposed algorithm in finding the temporal dependencies with lag intervals in sequential data.

【Keywords】: temporal dependency; time lag

83. Testing the significance of spatio-temporal teleconnection patterns.

Paper Link】 【Pages】:642-650

【Authors】: Jaya Kawale ; Snigdhansu Chatterjee ; Dominick Ormsby ; Karsten Steinhaeuser ; Stefan Liess ; Vipin Kumar

【Abstract】: Dipoles represent long distance connections between the pressure anomalies of two distant regions that are negatively correlated with each other. Such dipoles have proven important for understanding and explaining the variability in climate in many regions of the world, e.g., the El Nino climate phenomenon is known to be responsible for precipitation and temperature anomalies over large parts of the world. Systematic approaches for dipole detection generate a large number of candidate dipoles, but there exists no method to evaluate the significance of the candidate teleconnections. In this paper, we present a novel method for testing the statistical significance of the class of spatio-temporal teleconnection patterns called as dipoles. One of the most important challenges in addressing significance testing in a spatio-temporal context is how to address the spatial and temporal dependencies that show up as high autocorrelation. We present a novel approach that uses the wild bootstrap to capture the spatio-temporal dependencies, in the special use case of teleconnections in climate data. Our approach to find the statistical significance takes into account the autocorrelation, the seasonality and the trend in the time series over a period of time. This framework is applicable to other problems in spatio-temporal data mining to assess the significance of the patterns.

【Keywords】: significance testing

84. SeqiBloc: mining multi-time spanning blockmodels in dynamic graphs.

Paper Link】 【Pages】:651-659

【Authors】: Jeffrey Chan ; Wei Liu ; Christopher Leckie ; James Bailey ; Kotagiri Ramamohanarao

【Abstract】: Blockmodelling is an important technique for decomposing graphs into sets of roles. Vertices playing the same role have similar patterns of interactions with vertices in other roles. These roles, along with the role to role interactions, can succinctly summarise the underlying structure of the studied graphs. As the underlying graphs evolve with time, it is important to study how their blockmodels evolve too. This will enable us to detect role changes across time, detect different patterns of interactions, for example, weekday and weekend behaviour, and allow us to study how the structure in the underlying dynamic graph evolves. To date, there has been limited research on studying dynamic blockmodels. They focus on smoothing role changes between adjacent time instances. However, this approach can overfit during stationary periods where the underling structure does not change but there is random noise in the graph. Therefore, an approach to a) find blockmodels across spans of time and b) to find the stationary periods is needed. In this paper, we propose an information theoretic framework, SeqiBloc, combined with a change point detection approach to achieve a) and b). In addition, we propose new vertex equivalence definitions that include time, and show how they relate back to our information theoretic approach. We demonstrate their usefulness and superior accuracy over existing work on synthetic and real datasets.

【Keywords】: blockmodel; dynamic graphs; minimum description length; structural equivalence

85. USpan: an efficient algorithm for mining high utility sequential patterns.

Paper Link】 【Pages】:660-668

【Authors】: Junfu Yin ; Zhigang Zheng ; Longbing Cao

【Abstract】: Sequential pattern mining plays an important role in many applications, such as bioinformatics and consumer behavior analysis. However, the classic frequency-based framework often leads to many patterns being identified, most of which are not informative enough for business decision-making. In frequent pattern mining, a recent effort has been to incorporate utility into the pattern selection framework, so that high utility (frequent or infrequent) patterns are mined which address typical business concerns such as dollar value associated with each pattern. In this paper, we incorporate utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. An efficient algorithm, USpan, is presented to mine for high utility sequential patterns. In USpan, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with two effective pruning strategies. Substantial experiments on both synthetic and real datasets show that USpan efficiently identifies high utility sequences from large scale data with very low minimum utility.

【Keywords】: high utility sequential pattern mining; sequential pattern mining

86. Mining large-scale, sparse GPS traces for map inference: comparison of approaches.

Paper Link】 【Pages】:669-677

【Authors】: Xuemei Liu ; James Biagioni ; Jakob Eriksson ; Yin Wang ; George Forman ; Yanmin Zhu

【Abstract】: We address the problem of inferring road maps from large-scale GPS traces that have relatively low resolution and sampling frequency. Unlike past published work that requires high-resolution traces with dense sampling, we focus on situations with coarse granularity data, such as that obtained from thousands of taxis in Shanghai, which transmit their location as seldom as once per minute. Such data sources can be made available inexpensively as byproducts of existing processes, rather than having to drive every road with high-quality GPS instrumentation just for map building - and having to re-drive roads for periodic updates. Although the challenges in using opportunistic probe data are significant, successful mining algorithms could potentially enable the creation of continuously updated maps at very low cost. In this paper, we compare representative algorithms from two approaches: working with individual reported locations vs. segments between consecutive locations. We assess their trade-offs and effectiveness in both qualitative and quantitative comparisons for regions of Shanghai and Chicago.

【Keywords】: gps; map inference; road maps; spatial data mining

Research session a3: personalization and recommendation 5

87. Transparent user models for personalization.

Paper Link】 【Pages】:678-686

【Authors】: Khalid El-Arini ; Ulrich Paquet ; Ralf Herbrich ; Jurgen Van Gael ; Blaise Agüera y Arcas

【Abstract】: Personalization is a ubiquitous phenomenon in our daily online experience. While such technology is critical for helping us combat the overload of information we face, in many cases, we may not even realize that our results are being tailored to our personal tastes and preferences. Worse yet, when such a system makes a mistake, we have little recourse to correct it. In this work, we propose a framework for addressing this problem by developing a new user-interpretable feature set upon which to base personalized recommendations. These features, which we call badges, represent fundamental traits of users (e.g., "vegetarian" or "Apple fanboy") inferred by modeling the interplay between a user's behavior and self-reported identity. Specifically, we consider the microblogging site Twitter, where users provide short descriptions of themselves in their profiles, as well as perform actions such as tweeting and retweeting. Our approach is based on the insight that we can define badges using high precision, low recall rules (e.g., "Twitter profile contains the phrase 'Apple fanboy'"), and with enough data, generalize to other users by observing shared behavior. We develop a fully Bayesian, generative model that describes this interaction, while allowing us to avoid the pitfalls associated with having positive-only data. Experiments on real Twitter data demonstrate the effectiveness of our model at capturing rich and interpretable user traits that can be used to provide transparency for personalization.

【Keywords】: Twitter; graphical models; personalization; transparency

88. Estimating entity importance via counting set covers.

Paper Link】 【Pages】:687-695

【Authors】: Aristides Gionis ; Theodoros Lappas ; Evimaria Terzi

【Abstract】: The data-mining literature is rich in problems asking to assess the importance of entities in a given dataset. At a high level, existing work identifies important entities either by ranking or by selection. Ranking methods assign a score to every entity in the population, and then use the assigned scores to create a ranked list. The major shortcoming of such approaches is that they ignore the redundancy between high-ranked entities, which may in fact be very similar or even identical. Therefore, in scenarios where diversity is desirable, such methods perform poorly. Selection methods overcome this drawback by evaluating the importance of a group of entities collectively. To achieve this, they typically adopt a set-cover formulation, which identifies the entities in the minimum set cover as the important ones. However, this dichotomy of entities conceals the fact that, even though an entity may not be in the reported cover, it may still participate in many other optimal or near-optimal solutions. In this paper, we propose a framework that overcomes the above drawbacks by integrating the ranking and selection paradigms. Our approach assigns importance scores to entities based on both the number and the quality of set-cover solutions that they participate. Our algorithmic contribution lies with the design of an efficient algorithm for approximating the number of high-quality set covers that each entity participates. Our methodology applies to a wide range of applications. In a user study and an experimental evaluation on real data, we demonstrate that our framework is efficient and provides useful and intuitive results.

【Keywords】: counting; importance sampling; set cover

89. ComSoc: adaptive transfer of user behaviors over composite social network.

Paper Link】 【Pages】:696-704

【Authors】: ErHeng Zhong ; Wei Fan ; Junwei Wang ; Lei Xiao ; Yong Li

【Abstract】: Accurate prediction of user behaviors is important for many social media applications, including social marketing, personalization and recommendation, etc. A major challenge lies in that, the available behavior data or interactions between users and items in a given social network are usually very limited and sparse (e.g., >= 99.9% empty). Many previous works model user behavior from only historical user logs. We observe that many people are members of several social networks in the same time, such as Facebook, Twitter and Tencent's QQ. Importantly, their behaviors and interests in different networks influence one another. This gives us an opportunity to leverage the knowledge of user behaviors in different networks, in order to alleviate the data sparsity problem, and enhance the predictive performance of user modeling. Combining different networks "simply and naively" does not work well. Instead, we formulate the problem to model multiple networks as "composite network knowledge transfer". We first select the most suitable networks inside a composite social network via a hierarchical Bayesian model, parameterized for individual users, and then build topic models for user behavior prediction using both the relationships in the selected networks and related behavior data. To handle big data, we have implemented the algorithm using Map/Reduce. We demonstrate that the proposed composite network-based user behavior model significantly improve the predictive accuracy over a number of existing approaches on several real world applications, such as a very large social-networking dataset from Tencent Inc.

【Keywords】: social network analysis; transfer learning

90. Online learning to diversify from implicit feedback.

Paper Link】 【Pages】:705-713

【Authors】: Karthik Raman ; Pannaga Shivaswamy ; Thorsten Joachims

【Abstract】: In order to minimize redundancy and optimize coverage of multiple user interests, search engines and recommender systems aim to diversify their set of results. To date, these diversification mechanisms are largely hand-coded or relied on expensive training data provided by experts. To overcome this problem, we propose an online learning model and algorithms for learning diversified recommendations and retrieval functions from implicit feedback. In our model, the learning algorithm presents a ranking to the user at each step, and uses the set of documents from the presented ranking, which the user reads, as feedback. Even for imperfect and noisy feedback, we show that the algorithms admit theoretical guarantees for maximizing any submodular utility measure under approximately rational user behavior. In addition to the theoretical results, we find that the algorithm learns quickly, accurately, and robustly in empirical evaluations on two datasets.

【Keywords】: diversified retrieval; online learning; submodularity

91. Playlist prediction via metric embedding.

Paper Link】 【Pages】:714-722

【Authors】: Shuo Chen ; Joshua L. Moore ; Douglas Turnbull ; Thorsten Joachims

【Abstract】: Digital storage of personal music collections and cloud-based music services (e.g. Pandora, Spotify) have fundamentally changed how music is consumed. In particular, automatically generated playlists have become an important mode of accessing large music collections. The key goal of automated playlist generation is to provide the user with a coherent listening experience. In this paper, we present Latent Markov Embedding (LME), a machine learning algorithm for generating such playlists. In analogy to matrix factorization methods for collaborative filtering, the algorithm does not require songs to be described by features a priori, but it learns a representation from example playlists. We formulate this problem as a regularized maximum-likelihood embedding of Markov chains in Euclidian space, and show how the resulting optimization problem can be solved efficiently. An empirical evaluation shows that the LME is substantially more accurate than adaptations of smoothed n-gram models commonly used in natural language processing.

【Keywords】: music playlists; recommendation; sequences; user modeling

Research session a4: supervised learning with auxilliary information 5

92. Parallel field ranking.

Paper Link】 【Pages】:723-731

【Authors】: Ming Ji ; Binbin Lin ; Xiaofei He ; Deng Cai ; Jiawei Han

【Abstract】: Recently, ranking data with respect to the intrinsic geometric structure (manifold ranking) has received considerable attentions, with encouraging performance in many applications in pattern recognition, information retrieval and recommendation systems. Most of the existing manifold ranking methods focus on learning a ranking function that varies smoothly along the data manifold. However, beyond smoothness, a desirable ranking function should vary monotonically along the geodesics of the data manifold, such that the ranking order along the geodesics is preserved. In this paper, we aim to learn a ranking function that varies linearly and therefore monotonically along the geodesics of the data manifold. Recent theoretical work shows that the gradient field of a linear function on the manifold has to be a parallel vector field. Therefore, we propose a novel ranking algorithm on the data manifolds, called Parallel Field Ranking. Specifically, we try to learn a ranking function and a vector field simultaneously. We require the vector field to be close to the gradient field of the ranking function, and the vector field to be as parallel as possible. Moreover, we require the value of the ranking function at the query point to be the highest, and then decrease linearly along the manifold. Experimental results on both synthetic data and real data demonstrate the effectiveness of our proposed algorithm.

【Keywords】: manifold; ranking; vector field

93. Semi-supervised learning with mixed knowledge information.

Paper Link】 【Pages】:732-740

【Authors】: Fanhua Shang ; Licheng Jiao ; Fei Wang

【Abstract】: Integrating new knowledge sources into various learning tasks to improve their performance has recently become an interesting topic. In this paper we propose a novel semi-supervised learning (SSL) approach, called semi-supervised learning with Mixed Knowledge Information (SSL-MKI) which can simultaneously handle both sparse labeled data and additional pairwise constraints together with unlabeled data. Specifically, we first construct a unified SSL framework to combine the manifold assumption and the pairwise constraints assumption for classification tasks. Then we present a Modified Fixed Point Continuation (MFPC) algorithm with an eigenvalue thresholding (EVT) operator to learn the enhanced kernel matrix. Finally, we develop a two-stage optimization strategy and provide an efficient SSL approach that takes advantage of Laplacian spectral regularization: semi-supervised learning with Enhanced Spectral Kernel (ESK). Experimental results on a variety of synthetic and real-world datasets demonstrate the effectiveness of the proposed ESK approach.

【Keywords】: graph laplacian; kernel learning; nuclear norm regularization; pairwise constraints; semi-supervised learning (ssl)

94. Batch mode active sampling based on marginal probability distribution matching.

Paper Link】 【Pages】:741-749

【Authors】: Rita Chattopadhyay ; Zheng Wang ; Wei Fan ; Ian Davidson ; Sethuraman Panchanathan ; Jieping Ye

【Abstract】: Active Learning is a machine learning and data mining technique that selects the most informative samples for labeling and uses them as training data; it is especially useful when there are large amount of unlabeled data and labeling them is expensive. Recently, batch-mode active learning, where a set of samples are selected concurrently for labeling, based on their collective merit, has attracted a lot of attention. The objective of batch-mode active learning is to select a set of informative samples so that a classifier learned on these samples has good generalization performance on the unlabeled data. Most of the existing batch-mode active learning methodologies try to achieve this by selecting samples based on varied criteria. In this paper we propose a novel criterion which achieves good generalization performance of a classifier by specifically selecting a set of query samples that minimizes the difference in distribution between the labeled and the unlabeled data, after annotation. We explicitly measure this difference based on all candidate subsets of the unlabeled data and select the best subset. The proposed objective is an NP-hard integer programming optimization problem. We provide two optimization techniques to solve this problem. In the first one, the problem is transformed into a convex quadratic programming problem and in the second method the problem is transformed into a linear programming problem. Our empirical studies using publicly available UCI datasets and a biomedical image dataset demonstrate the effectiveness of the proposed approach in comparison with the state-of-the-art batch-mode active learning methods. We also present two extensions of the proposed approach, which incorporate uncertainty of the predicted labels of the unlabeled data and transfer learning in the proposed formulation. Our empirical studies on UCI datasets show that incorporation of uncertainty information improves performance at later iterations while our studies on 20 Newsgroups dataset show that transfer learning improves the performance of the classifier during initial iterations.

【Keywords】: active learning; marginal probability distribution; maximum mean discrepancy

95. SPF-GMKL: generalized multiple kernel learning with a million kernels.

Paper Link】 【Pages】:750-758

【Authors】: Ashesh Jain ; S. V. N. Vishwanathan ; Manik Varma

【Abstract】: Multiple Kernel Learning (MKL) aims to learn the kernel in an SVM from training data. Many MKL formulations have been proposed and some have proved effective in certain applications. Nevertheless, as MKL is a nascent field, many more formulations need to be developed to generalize across domains and meet the challenges of real world applications. However, each MKL formulation typically necessitates the development of a specialized optimization algorithm. The lack of an efficient, general purpose optimizer capable of handling a wide range of formulations presents a significant challenge to those looking to take MKL out of the lab and into the real world. This problem was somewhat alleviated by the development of the Generalized Multiple Kernel Learning (GMKL) formulation which admits fairly general kernel parameterizations and regularizers subject to mild constraints. However, the projected gradient descent GMKL optimizer is inefficient as the computation of the step size and a reasonably accurate objective function value or gradient direction are all expensive. We overcome these limitations by developing a Spectral Projected Gradient (SPG) descent optimizer which: a) takes into account second order information in selecting step sizes; b) employs a non-monotone step size selection criterion requiring fewer function evaluations; c) is robust to gradient noise, and d) can take quick steps when far away from the optimum. We show that our proposed SPG-GMKL optimizer can be an order of magnitude faster than projected gradient descent on even small and medium sized datasets. In some cases, SPG-GMKL can even outperform state-of-the-art specialized optimization algorithms developed for a single MKL formulation. Furthermore, we demonstrate that SPG-GMKL can scale well beyond gradient descent to large problems involving a million kernels or half a million data points. Our code and implementation are available publically.

【Keywords】: multiple kernel learning; spectral projected gradient descent; support vector machines

96. Efficient evaluation of large sequence kernels.

Paper Link】 【Pages】:759-767

【Authors】: Pavel P. Kuksa ; Vladimir Pavlovic

【Abstract】: Classification of sequences drawn from a finite alphabet using a family of string kernels with inexact matching (e.g., spectrum or mismatch) has shown great success in machine learning. However, selection of optimal mismatch kernels for a particular task is severely limited by inability to compute such kernels for long substrings (k-mers) with potentially many mismatches (m). In this work we introduce a new method that allows us to exactly evaluate kernels for large k, m and arbitrary alphabet size. The task can be accomplished by first solving the more tractable problem for small alphabets, and then trivially generalizing to any alphabet using a small linear system of equations. This makes it possible to explore a larger set of kernels with a wide range of kernel parameters, opening a possibility to better model selection and improved performance of the string kernels. To investigate the utility of large (k,m) string kernels, we consider several sequence classification problems, including protein remote homology detection, fold prediction, and music classification. Our results show that increased k-mer lengths with larger substitutions can improve classification performance.

【Keywords】: sequence classification; string kernels

Industry/govt track a5: computational advertising 5

97. Estimating conversion rate in display advertising from past erformance data.

Paper Link】 【Pages】:768-776

【Authors】: Kuang-chih Lee ; Burkay Orten ; Ali Dasdan ; Wentong Li

【Abstract】: In targeted display advertising, the goal is to identify the best opportunities to display a banner ad to an online user who is most likely to take a desired action such as purchasing a product or signing up for a newsletter. Finding the best ad impression, i.e., the opportunity to show an ad to a user, requires the ability to estimate the probability that the user who sees the ad on his or her browser will take an action, i.e., the user will convert. However, conversion probability estimation is a challenging task since there is extreme data sparsity across different data dimensions and the conversion event occurs rarely. In this paper, we present our approach to conversion rate estimation which relies on utilizing past performance observations along user, publisher and advertiser data hierarchies. More specifically, we model the conversion event at different select hierarchical levels with separate binomial distributions and estimate the distribution parameters individually. Then we demonstrate how we can combine these individual estimators using logistic regression to accurately identify conversion events. In our presentation, we also discuss main practical considerations such as data imbalance, missing data, and output probability calibration, which render this estimation problem more difficult but yet need solving for a real-world implementation of the approach. We provide results from real advertising campaigns to demonstrate the effectiveness of our proposed approach.

【Keywords】: action rate estimation; algorithemic advertising; computational advertising; logistic regression

98. Multimedia features for click prediction of new ads in display advertising.

Paper Link】 【Pages】:777-785

【Authors】: Haibin Cheng ; Roelof van Zwol ; Javad Azimi ; Eren Manavoglu ; Ruofei Zhang ; Yang Zhou ; Vidhya Navalpakkam

【Abstract】: Non-guaranteed display advertising (NGD) is a multi-billion dollar business that has been growing rapidly in recent years. Advertisers in NGD sell a large portion of their ad campaigns using performance dependent pricing models such as cost-per-click (CPC) and cost-per-action (CPA). An accurate prediction of the probability that users click on ads is a crucial task in NGD advertising because this value is required to compute the expected revenue. State-of-the-art prediction algorithms rely heavily on historical information collected for advertisers, users and publishers. Click prediction of new ads in the system is a challenging task due to the lack of such historical data. The objective of this paper is to mitigate this problem by integrating multimedia features extracted from display ads into the click prediction models. Multimedia features can help us capture the attractiveness of the ads with similar contents or aesthetics. In this paper we evaluate the use of numerous multimedia features (in addition to commonly used user, advertiser and publisher features) for the purposes of improving click prediction in ads with no history. We provide analytical results generated over billions of samples and demonstrate that adding multimedia features can significantly improve the accuracy of click prediction for new ads, compared to a state-of-the-art baseline model.

【Keywords】: GMM; click prediction; display advertising; flash; image; multimedia features; new ads

99. Trustworthy online controlled experiments: five puzzling outcomes explained.

Paper Link】 【Pages】:786-794

【Authors】: Ron Kohavi ; Alex Deng ; Brian Frasca ; Roger Longbotham ; Toby Walker ; Ya Xu

【Abstract】: Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale--thousands of experiments now--has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory. We present our learnings as they happened: puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments. At Microsoft's Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts. The topics we cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.

【Keywords】: a/b testing; controlled experiments; randomized experiments

Paper Link】 【Pages】:795-803

【Authors】: Ye Chen ; Tak W. Yan

【Abstract】: Click-through rate (CTR) prediction plays a central role in search advertising. One needs CTR estimates unbiased by positional effect in order for ad ranking, allocation, and pricing to be based upon ad relevance or quality in terms of click propensity. However, the observed click-through data has been confounded by positional bias, that is, users tend to click more on ads shown in higher positions than lower ones, regardless of the ad relevance. We describe a probabilistic factor model as a general principled approach to studying these exogenous and often overwhelming phenomena. The model is simple and linear in nature, while empirically justified by the advertising domain. Our experimental results with artificial and real-world sponsored search data show the soundness of the underlying model assumption, which in turn yields superior prediction accuracy.

【Keywords】: exogeneity; factor models; search advertising

101. Bid optimizing and inventory scoring in targeted online advertising.

Paper Link】 【Pages】:804-812

【Authors】: Claudia Perlich ; Brian Dalessandro ; Rod Hook ; Ori Stitelman ; Troy Raeder ; Foster J. Provost

【Abstract】: Billions of online display advertising spots are purchased on a daily basis through real time bidding exchanges (RTBs). Advertising companies bid for these spots on behalf of a company or brand in order to purchase these spots to display banner advertisements. These bidding decisions must be made in fractions of a second after the potential purchaser is informed of what location (Internet site) has a spot available and who would see the advertisement. The entire transaction must be completed in near real-time to avoid delays loading the page and maintain a good users experience. This paper presents a bid-optimization approach that is implemented in production at Media6Degrees for bidding on these advertising opportunities at an appropriate price. The approach combines several supervised learning algorithms, as well as second price auction theory, to determine the correct price to ensure that the right message is delivered to the right person, at the right time.

【Keywords】: bid optimization; causality; logistic regression; online advertising; probability estimation

Asia-Pacific track a6: session 2 4

102. Algorithms for mining uncertain graph data.

Paper Link】 【Pages】:813

【Authors】: Jianzhong Li

【Abstract】: With the rapid development of advanced data acquisition techniques such as high-throughput biological experiments and wireless sensor networks, large amount of graph-structured data, graph data for short, have been collected in a wide range of applications. Discovering knowledge from graph data has witnessed a number of applications and received a lot of research attentions. Recently, it is observed that uncertainties are inherent in the structures of some graph data. For example, protein-protein interaction (PPI) data can be represented as a graph, where vertices represent proteins, and edges represent PPI's. Due to the limits of PPI detection methods, it is uncertain that a detected PPI exist in practice. Other examples of uncertain graph data include topologies of wireless sensor networks, social networks and so on. Managing and mining such large-scale uncertain graph data is of both theoretical and practical significance. Many solid works have been conducted on uncertain graph mining from the aspects of models, semantics, methodology and algorithms in last few years. A number of research papers on managing and mining uncertain graph data have been published in the database and data mining conferences such as VLDB, ICDE, KDD, CIKM and EDBT. This talk focuses on the data model, semantics, computational complexity and algorithms of uncertain graph mining. In the talk, some typical research work in the field of uncertain graph mining will also be introduced, including frequent subgraph pattern mining, dense subgraph detection, reliable subgraph discovery, and clustering on uncertain graph data.

【Keywords】: graph mining; uncertain data

103. Experience with discovering knowledge by acquiring it.

Paper Link】 【Pages】:814

【Authors】: Paul Compton

【Abstract】: Machines and people have complementary skills in Knowledge Discovery. Automated techniques can process enormous amounts of data to find new relationships, but generally these are represented by fairly simple models. On the other hand people are endlessly inventive in creating models to explain data at hand, but have problems developing consistent overall models to explain all the data that might occur in a domain; and the larger the model, the more difficult it becomes to maintain consistency. Ripple-Down Rules is a technique that has been developed to allow people to make real-time updates to a model whenever they notice some data that the model does not yet explain, while at the same time maintaining consistency. This allows an entire knowledge base to be built while it is already in use by making updates. There are now 100s of Ripple-Down-Rule knowledge bases in use and this paper presents some observations from log files tracking how people build these systems, and also outlines some recent research on how such techniques can be used to add greater specificity to the simpler models developed by automated techniques.

【Keywords】: knowledge base; rule-based model

104. Bayesian relational data analysis.

Paper Link】 【Pages】:815

【Authors】: Naonori Ueda

【Abstract】: Recently there have been many collections of relational data in diverse areas such as the internet, social networks, customer shopping records, bioinformatics, etc. The main goal of the relational data analysis is to discover latent structure from the data. The conventional data mining algorithms based on exhaustive enumeration have an inherent limitation for this purpose because of the combinatorial nature of the methods. In contrast, in machine learning a lot of statistical models have been proposed for the relational data analysis. In this talk, first I will review the statistical approach, especially Bayesian approach, for the relational data analysis with recent advancements in machine learning literature. Then, as a future research I will also talk about a statistical approach for combining multiple relational data.

【Keywords】: bayesian analysis; relational data

105. Cross-media knowledge discovery.

Paper Link】 【Pages】:816

【Authors】: Zhongzhi Shi

【Abstract】: In this talk I introduce cloud computing based cross-media knowledge discovery. We propose a framework for cross-media semantic understanding which contains discriminative modeling, generative modeling and cognitive modeling. In cognitive modeling a new model entitled CAM is proposed which is suitable for cross-media semantic understanding. We develop an agent-aid model for load balance in cloud computing environment. For quality of service we present a utility function to evaluate the cloud performance. A Cross-Media Intelligent Retrieval System (CMIRS), which is managed by ontology-based knowledge system KMSphere, will be illustrated. Finally, the directions for further researches on cloud computing based cross-media knowledge discovery will be pointed out and discussed.

【Keywords】: cloud computing; cross-media knowledge discovery

Research session b1: review, discussion, and q & a 4

106. Review spam detection via temporal pattern discovery.

Paper Link】 【Pages】:823-831

【Authors】: Sihong Xie ; Guan Wang ; Shuyang Lin ; Philip S. Yu

【Abstract】: Online reviews play a crucial role in today's electronic commerce. It is desirable for a customer to read reviews of products or stores before making the decision of what or from where to buy. Due to the pervasive spam reviews, customers can be misled to buy low-quality products, while decent stores can be defamed by malicious reviews. We observe that, in reality, a great portion (> 90% in the data we study) of the reviewers write only one review (singleton review). These reviews are so enormous in number that they can almost determine a store's rating and impression. However, existing methods did not examine this larger part of the reviews. Are most of these singleton reviews truthful ones? If not, how to detect spam reviews in singleton reviews? We call this problem singleton review spam detection. To address this problem, we observe that the normal reviewers' arrival pattern is stable and uncorrelated to their rating pattern temporally. In contrast, spam attacks are usually bursty and either positively or negatively correlated to the rating. Thus, we propose to detect such attacks via unusually correlated temporal patterns. We identify and construct multidimensional time series based on aggregate statistics, in order to depict and mine such correlations. In this way, the singleton review spam detection problem is mapped to a abnormally correlated pattern detection problem. We propose a hierarchical algorithm to robustly detect the time windows where such attacks are likely to have happened. The algorithm also pinpoints such windows in different time resolutions to facilitate faster human inspection. Experimental results show that the proposed method is effective in detecting singleton review attacks. We discover that singleton review is a significant source of spam reviews and largely affects the ratings of online stores.

【Keywords】: adversarial data mining; review spam; time series

107. Selecting a characteristic set of reviews.

Paper Link】 【Pages】:832-840

【Authors】: Theodoros Lappas ; Mark Crovella ; Evimaria Terzi

【Abstract】: Online reviews provide consumers with valuable information that guides their decisions on a variety of fronts: from entertainment and shopping to medical services. Although the proliferation of online reviews gives insights about different aspects of a product, it can also prove a serious drawback: consumers cannot and will not read thousands of reviews before making a purchase decision. This need to extract useful information from large review corpora has spawned considerable prior work, but so far all have drawbacks. Review summarization (generating statistical descriptions of review sets) sacrifices the immediacy and narrative structure of reviews. Likewise, review selection (identifying a subset of 'helpful' or 'important' reviews) leads to redundant or non-representative summaries. In this paper, we fill the gap between existing review-summarization and review-selection methods by selecting a small subset of reviews that together preserve the statistical properties of the entire review corpus. We formalize this task as a combinatorial optimization problem and show that it NP-hard both tosolve and approximate. We also design effective algorithms that prove to work well in practice. Our experiments with real review corpora on different types of products demonstrate the utility of our methods, and our user studies indicate that our methods provide a better summary than prior approaches.

【Keywords】: customer reviews; review selection; reviews

108. Mining contentions from discussions and debates.

Paper Link】 【Pages】:841-849

【Authors】: Arjun Mukherjee ; Bing Liu

【Abstract】: Social media has become a major source of information for many applications. Numerous techniques have been proposed to analyze network structures and text contents. In this paper, we focus on fine-grained mining of contentions in discussion/debate forums. Contentions are perhaps the most important feature of forums that discuss social, political and religious issues. Our goal is to discover contention and agreement indicator expressions, and contention points or topics both at the discussion collection level and also at each individual post level. To the best of our knowledge, limited work has been done on such detailed analysis. This paper proposes three models to solve the problem, which not only model both contention/agreement expressions and discussion topics, but also, more importantly, model the intrinsic nature of discussions/debates, i.e., interactions among discussants or debaters and topic sharing among posts through quoting and replying relations. Evaluation results using real-life discussion/debate posts from several domains demonstrate the effectiveness of the proposed models.

【Keywords】: contention analysis; debates; discussions; social media

109. Discovering value from community activity on focused question answering sites: a case study of stack overflow.

Paper Link】 【Pages】:850-858

【Authors】: Ashton Anderson ; Daniel P. Huttenlocher ; Jon M. Kleinberg ; Jure Leskovec

【Abstract】: Question answering (Q&A) websites are now large repositories of valuable knowledge. While most Q&A sites were initially aimed at providing useful answers to the question asker, there has been a marked shift towards question answering as a community-driven knowledge creation process whose end product can be of enduring value to a broad audience. As part of this shift, specific expertise and deep knowledge of the subject at hand have become increasingly important, and many Q&A sites employ voting and reputation mechanisms as centerpieces of their design to help users identify the trustworthiness and accuracy of the content. To better understand this shift in focus from one-off answers to a group knowledge-creation process, we consider a question together with its entire set of corresponding answers as our fundamental unit of analysis, in contrast with the focus on individual question-answer pairs that characterized previous work. Our investigation considers the dynamics of the community activity that shapes the set of answers, both how answers and voters arrive over time and how this influences the eventual outcome. For example, we observe significant assortativity in the reputations of co-answerers, relationships between reputation and answer speed, and that the probability of an answer being chosen as the best one strongly depends on temporal characteristics of answer arrivals. We then show that our understanding of such properties is naturally applicable to predicting several important quantities, including the long-term value of the question and its answers, as well as whether a question requires a better answer. Finally, we discuss the implications of these results for the design of Q&A sites.

【Keywords】: question-answering; reputation; value prediction

Research session b2: outlier and intrusion detection 4

110. Integrating community matching and outlier detection for mining evolutionary community outliers.

Paper Link】 【Pages】:859-867

【Authors】: Manish Gupta ; Jing Gao ; Yizhou Sun ; Jiawei Han

【Abstract】: Temporal datasets, in which data evolves continuously, exist in a wide variety of applications, and identifying anomalous or outlying objects from temporal datasets is an important and challenging task. Different from traditional outlier detection, which detects objects that have quite different behavior compared with the other objects, temporal outlier detection tries to identify objects that have different evolutionary behavior compared with other objects. Usually objects form multiple communities, and most of the objects belonging to the same community follow similar patterns of evolution. However, there are some objects which evolve in a very different way relative to other community members, and we define such objects as evolutionary community outliers. This definition represents a novel type of outliers considering both temporal dimension and community patterns. We investigate the problem of identifying evolutionary community outliers given the discovered communities from two snapshots of an evolving dataset. To tackle the challenges of community evolution and outlier detection, we propose an integrated optimization framework which conducts outlier-aware community matching across snapshots and identification of evolutionary outliers in a tightly coupled way. A coordinate descent algorithm is proposed to improve community matching and outlier detection performance iteratively. Experimental results on both synthetic and real datasets show that the proposed approach is highly effective in discovering interesting evolutionary community outliers.

【Keywords】: anomaly detection; community matching; ecoutlier; evolutionary community outliers; temporal outliers

111. Different slopes for different folks: mining for exceptional regression models with cook's distance.

Paper Link】 【Pages】:868-876

【Authors】: Wouter Duivesteijn ; Ad Feelders ; Arno J. Knobbe

【Abstract】: Exceptional Model Mining (EMM) is an exploratory data analysis technique that can be regarded as a generalization of subgroup discovery. In EMM we look for subgroups of the data for which a model fitted to the subgroup differs substantially from the same model fitted to the entire dataset. In this paper we develop methods to mine for exceptional regression models. We propose a measure for the exceptionality of regression models (Cook's distance), and explore the possibilities to avoid having to fit the regression model to each candidate subgroup. The algorithm is evaluated on a number of real life datasets. These datasets are also used to illustrate the results of the algorithm. We find interesting subgroups with deviating models on datasets from several different domains. We also show that under certain circumstances one can forego fitting regression models on up to 40% of the subgroups, and these 40% are the relatively expensive regression models to compute.

【Keywords】: cook's distance; exceptional model mining; linear regression; subgroup discovery

112. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data.

Paper Link】 【Pages】:877-885

【Authors】: Ninh Pham ; Rasmus Pagh

【Abstract】: Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of applications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest neighbor are deteriorated in high-dimensional data. Following up on the work of Kriegel et al. (KDD '08), we investigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality of approximation to guarantee the reliability of our estimation algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is efficient and scalable to very large high-dimensional data sets.

【Keywords】: AMS sketch; angle-based; high-dimensional; outlier detection; random projection

113. Intrusion as (anti)social communication: characterization and detection.

Paper Link】 【Pages】:886-894

【Authors】: Qi Ding ; Natallia Katenka ; Paul Barford ; Eric D. Kolaczyk ; Mark Crovella

【Abstract】: A reasonable definition of intrusion is: entering a community to which one does not belong. This suggests that in a network, intrusion attempts may be detected by looking for communication that does not respect community boundaries. In this paper, we examine the utility of this concept for identifying malicious network sources. In particular, our goal is to explore whether this concept allows a core-network operator using flow data to augment signature-based systems located at network edges. We show that simple measures of communities can be defined for flow data that allow a remarkably effective level of intrusion detection simply by looking for flows that do not respect those communities. We validate our approach using labeled intrusion attempt data collected at a large number of edge networks. Our results suggest that community-based methods can offer an important additional dimension for intrusion detection systems.

【Keywords】: intrusion detection; social graphs

Research session b3: feature selection 4

114. Robust multi-task feature learning.

Paper Link】 【Pages】:895-903

【Authors】: Pinghua Gong ; Jieping Ye ; Changshui Zhang

【Abstract】: Multi-task learning (MTL) aims to improve the performance of multiple related tasks by exploiting the intrinsic relationships among them. Recently, multi-task feature learning algorithms have received increasing attention and they have been successfully applied to many applications involving high dimensional data. However, they assume that all tasks share a common set of features, which is too restrictive and may not hold in real-world applications, since outlier tasks often exist. In this paper, we propose a Robust Multi-Task Feature Learning algorithm (rMTFL) which simultaneously captures a common set of features among relevant tasks and identifies outlier tasks. Specifically, we decompose the weight (model) matrix for all tasks into two components. We impose the well-known group Lasso penalty on row groups of the first component for capturing the shared features among relevant tasks. To simultaneously identify the outlier tasks, we impose the same group Lasso penalty but on column groups of the second component. We propose to employ the accelerated gradient descent to efficiently solve the optimization problem in rMTFL, and show that the proposed algorithm is scalable to large-size problems. In addition, we provide a detailed theoretical analysis on the proposed rMTFL formulation. Specifically, we present a theoretical bound to measure how well our proposed rMTFL approximates the true evaluation, and provide bounds to measure the error between the estimated weights of rMTFL and the underlying true weights. Moreover, by assuming that the underlying true weights are above the noise level, we present a sound theoretical result to show how to obtain the underlying true shared features and outlier tasks (sparsity patterns). Empirical studies on both synthetic and real-world data demonstrate that our proposed rMTFL is capable of simultaneously capturing shared features among tasks and identifying outlier tasks.

【Keywords】: feature selection; multi-task learning; outlier tasks detection

115. Unsupervised feature selection for linked social media data.

Paper Link】 【Pages】:904-912

【Authors】: Jiliang Tang ; Huan Liu

【Abstract】: The prevalent use of social media produces mountains of unlabeled, high-dimensional data. Feature selection has been shown effective in dealing with high-dimensional data for efficient data mining. Feature selection for unlabeled data remains a challenging task due to the absence of label information by which the feature relevance can be assessed. The unique characteristics of social media data further complicate the already challenging problem of unsupervised feature selection, (e.g., part of social media data is linked, which makes invalid the independent and identically distributed assumption), bringing about new challenges to traditional unsupervised feature selection algorithms. In this paper, we study the differences between social media data and traditional attribute-value data, investigate if the relations revealed in linked data can be used to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked social media data. We perform experiments with real-world social media datasets to evaluate the effectiveness of the proposed framework and probe the working of its key components.

【Keywords】: linked social media data; pseudo-class label; unsupervised feature selection

116. Model mining for robust feature selection.

Paper Link】 【Pages】:913-921

【Authors】: Adam Woznica ; Phong Nguyen ; Alexandros Kalousis

【Abstract】: A common problem with most of the feature selection methods is that they often produce feature sets--models--that are not stable with respect to slight variations in the training data. Different authors tried to improve the feature selection stability using ensemble methods which aggregate different feature sets into a single model. However, the existing ensemble feature selection methods suffer from two main shortcomings: (i) the aggregation treats the features independently and does not account for their interactions, and (ii) a single feature set is returned, nevertheless, in various applications there might be more than one feature sets, potentially redundant, with similar information content. In this work we address these two limitations. We present a general framework in which we mine over different feature models produced from a given dataset in order to extract patterns over the models. We use these patterns to derive more complex feature model aggregation strategies that account for feature interactions, and identify core and distinct feature models. We conduct an extensive experimental evaluation of the proposed framework where we demonstrate its effectiveness over a number of high-dimensional problems from the fields of biology and text-mining.

【Keywords】: classification; feature selection; high-dimensional data; model mining; stability

117. Feature grouping and selection over an undirected graph.

Paper Link】 【Pages】:922-930

【Authors】: Sen Yang ; Lei Yuan ; Ying-Cheng Lai ; Xiaotong Shen ; Peter Wonka ; Jieping Ye

【Abstract】: High-dimensional regression/classification continues to be an important and challenging problem, especially when features are highly correlated. Feature selection, combined with additional structure information on the features has been considered to be promising in promoting regression/classification performance. Graph-guided fused lasso (GFlasso) has recently been proposed to facilitate feature selection and graph structure exploitation, when features exhibit certain graph structures. However, the formulation in GFlasso relies on pairwise sample correlations to perform feature grouping, which could introduce additional estimation bias. In this paper, we propose three new feature grouping and selection methods to resolve this issue. The first method employs a convex function to penalize the pairwise l∞ norm of connected regression/classification coefficients, achieving simultaneous feature grouping and selection. The second method improves the first one by utilizing a non-convex function to reduce the estimation bias. The third one is the extension of the second method using a truncated l1 regularization to further reduce the estimation bias. The proposed methods combine feature grouping and feature selection to enhance estimation accuracy. We employ the alternating direction method of multipliers (ADMM) and difference of convex functions (DC) programming to solve the proposed formulations. Our experimental results on synthetic data and two real datasets demonstrate the effectiveness of the proposed methods.

【Keywords】: l 1 regularization; classification; feature grouping; feature selection; regression; undirected graph

Research session b4: nearest neighbors 4

Paper Link】 【Pages】:931-939

【Authors】: Parikshit Ram ; Alexander G. Gray

【Abstract】: The problem of efficiently finding the best match for a query in a given set with respect to the Euclidean distance or the cosine similarity has been extensively studied. However, the closely related problem of efficiently finding the best match with respect to the inner-product has never been explored in the general setting to the best of our knowledge. In this paper we consider this problem and contrast it with the previous problems considered. First, we propose a general branch-and-bound algorithm based on a (single) tree data structure. Subsequently, we present a dual-tree algorithm for the case where there are multiple queries. Our proposed branch-and-bound algorithms are based on novel inner-product bounds. Finally we present a new data structure, the cone tree, for increasing the efficiency of the dual-tree algorithm. We evaluate our proposed algorithms on a variety of data sets from various applications, and exhibit up to five orders of magnitude improvement in query time over the naive search technique in some cases.

【Keywords】: cone trees; dual-tree branch-and-bound; metric trees

119. A probabilistic model for multimodal hash function learning.

Paper Link】 【Pages】:940-948

【Authors】: Yi Zhen ; Dit-Yan Yeung

【Abstract】: In recent years, both hashing-based similarity search and multimodal similarity search have aroused much research interest in the data mining and other communities. While hashing-based similarity search seeks to address the scalability issue, multimodal similarity search deals with applications in which data of multiple modalities are available. In this paper, our goal is to address both issues simultaneously. We propose a probabilistic model, called multimodal latent binary embedding (MLBE), to learn hash functions from multimodal data automatically. MLBE regards the binary latent factors as hash codes in a common Hamming space. Given data from multiple modalities, we devise an efficient algorithm for the learning of binary latent factors which corresponds to hash function learning. Experimental validation of MLBE has been conducted using both synthetic data and two realistic data sets. Experimental results show that MLBE compares favorably with two state-of-the-art models.

【Keywords】: binary latent factor models; hash function learning; metric learning; multimodal similarity search

120. On socio-spatial group query for location-based social networks.

Paper Link】 【Pages】:949-957

【Authors】: De-Nian Yang ; Chih-Ya Shen ; Wang-Chien Lee ; Ming-Syan Chen

【Abstract】: Challenges faced in organizing impromptu activities are the requirements of making timely invitations in accordance with the locations of candidate attendees and the social relationship among them. It is desirable to find a group of attendees close to a rally point and ensure that the selected attendees have a good social relationship to create a good atmosphere in the activity. Therefore, this paper proposes Socio-Spatial Group Query (SSGQ) to select a group of nearby attendees with tight social relation. Efficient processing of SSGQ is very challenging due to the tradeoff in the spatial and social domains. We show that the problem is NP-hard via a proof and design an efficient algorithm SSGSelect, which includes effective pruning techniques to reduce the running time for finding the optimal solution. We also propose a new index structure, Social R-Tree to further improve the efficiency. User study and experimental results demonstrate that SSGSelect significantly outperforms manual coordination in both solution quality and efficiency.

【Keywords】: query processing; social networks; spatial indexing

121. Random forests for metric learning with implicit pairwise position dependence.

Paper Link】 【Pages】:958-966

【Authors】: Caiming Xiong ; David M. Johnson ; Ran Xu ; Jason J. Corso

【Abstract】: Metric learning makes it plausible to learn semantically meaningful distances for complex distributions of data using label or pairwise constraint information. However, to date, most metric learning methods are based on a single Mahalanobis metric, which cannot handle heterogeneous data well. Those that learn multiple metrics throughout the feature space have demonstrated superior accuracy, but at a severe cost to computational efficiency. Here, we adopt a new angle on the metric learning problem and learn a single metric that is able to implicitly adapt its distance function throughout the feature space. This metric adaptation is accomplished by using a random forest-based classifier to underpin the distance function and incorporate both absolute pairwise position and standard relative position into the representation. We have implemented and tested our method against state of the art global and multi-metric methods on a variety of data sets. Overall, the proposed method outperforms both types of method in terms of accuracy (consistently ranked first) and is an order of magnitude faster than state of the art multi-metric methods (16x faster in the worst case).

【Keywords】: metric learning; pairwise constraints; random forests

Industry/govt track b5: business intelligence 4

122. Empowering authors to diagnose comprehension burden in textbooks.

Paper Link】 【Pages】:967-975

【Authors】: Rakesh Agrawal ; Sunandan Chakraborty ; Sreenivas Gollapudi ; Anitha Kannan ; Krishnaram Kenthapadi

【Abstract】: Good textbooks are organized in a systematically progressive fashion so that students acquire new knowledge and learn new concepts based on known items of information. We provide a diagnostic tool for quantitatively assessing the comprehension burden that a textbook imposes on the reader due to non-sequential presentation of concepts. We present a formal definition of comprehension burden and propose an algorithmic approach for computing it. We apply the tool to a corpus of high school textbooks from India and empirically examine its effectiveness in helping authors identify sections of textbooks that can benefit from reorganizing the material presented.

【Keywords】: comprehension burden; concepts; data mining; diagnostic tool for authors; education; knowledge discovery; textbooks

123. Coupled behavior analysis for capturing coupling relationships in group-based market manipulations.

Paper Link】 【Pages】:976-984

【Authors】: Yin Song ; Longbing Cao ; Xindong Wu ; Gang Wei ; Wu Ye ; Wei Ding

【Abstract】: In stock markets, an emerging challenge for surveillance is that a group of hidden manipulators collaborate with each other to manipulate the price movement of securities. Recently, the coupled hidden Markov model (CHMM)-based coupled behavior analysis (CBA) has been proposed to consider the coupling relationships in the above group-based behaviors for manipulation detection. From the modeling perspective, however, this requires overall aggregation of the behavioral data to cater for the CHMM modeling, which does not differentiate the coupling relationships presented in different forms within the aggregated behaviors and degrade the capability for further anomaly detection. Thus, this paper suggests a general CBA framework for detecting group-based market manipulation by capturing more comprehensive couplings and proposes two variant implementations, which are hybrid coupling (HC)-based and hierarchical grouping (HG)-based respectively. The proposed framework consists of three stages. The first stage, qualitative analysis, generates possible qualitative coupling relationships between behaviors with or without domain knowledge. In the second stage, quantitative representation of coupled behaviors is learned via proper methods. For the third stage, anomaly detection algorithms are proposed to cater for different application scenarios. Experimental results on data from a major Asian stock market show that the proposed framework outperforms the CHMM-based analysis in terms of detecting abnormal collaborative market manipulations. Additionally, the two different implementations are compared with their effectiveness for different application scenarios.

【Keywords】: anomaly detection; coupled behavior analysis; coupled hidden markov model; hierarchical clustering; market manipulation; relational learning

124. HySAD: a semi-supervised hybrid shilling attack detector for trustworthy product recommendation.

Paper Link】 【Pages】:985-993

【Authors】: Zhiang Wu ; Junjie Wu ; Jie Cao ; Dacheng Tao

【Abstract】: Shilling attackers apply biased rating profiles to recommender systems for manipulating online product recommendations. Although many studies have been devoted to shilling attack detection, few of them can handle the hybrid shilling attacks that usually happen in practice, and the studies for real-life applications are rarely seen. Moreover, little attention has yet been paid to modeling both labeled and unlabeled user profiles, although there are often a few labeled but numerous unlabeled users available in practice. This paper presents a Hybrid Shilling Attack Detector, or HySAD for short, to tackle these problems. In particular, HySAD introduces MC-Relief to select effective detection metrics, and Semi-supervised Naive Bayes (SNB_lambda) to precisely separate Random-Filler model attackers and Average-Filler model attackers from normal users. Thorough experiments on MovieLens and Netflix datasets demonstrate the effectiveness of HySAD in detecting hybrid shilling attacks, and its robustness for various obfuscated strategies. A real-life case study on product reviews of Amazon.cn is also provided, which further demonstrates that HySAD can effectively improve the accuracy of a collaborative-filtering based recommender system, and provide interesting opportunities for in-depth analysis of attacker behaviors. These, in turn, justify the value of HySAD for real-world applications.

【Keywords】: hybrid shilling attack; naive bayes; recommendation systems; semi-supervised learning

125. Following the electrons: methods for power management in commercial buildings.

Paper Link】 【Pages】:994-1002

【Authors】: Gowtham Bellala ; Manish Marwah ; Martin F. Arlitt ; Geoff Lyon ; Cullen Bash

【Abstract】: Commercial buildings are significant consumers of electricity. The first step towards better energy management in commercial buildings is monitoring consumption. However, instrumenting every electrical panel in a large commercial building is expensive and wasteful. In this paper, we propose a greedy meter (sensor) placement algorithm based on maximization of information gained, subject to a cost constraint. The algorithm provides a near-optimal solution guarantee. Furthermore, to identify power saving opportunities, we use an unsupervised anomaly detection technique based on a low-dimensional embedding. Further, to better manage resources such as lighting and HVAC, we propose a semi-supervised approach combining hidden Markov models (HMM) and a standard classifier to model occupancy based on readily available port-level network statistics.

【Keywords】: anomaly detection; commercial buildings; meter placement; occupancy modeling; power management

Industrial practice expo B6: session 3 2

126. Ensembles and model delivery for tax compliance.

Paper Link】 【Pages】:1003

【Authors】: Graham Williams

【Abstract】: Revenue authorities characteristically have large stores of historic audit data, with outcomes, ready for analysis. The Australian Taxation Office established one of the largest data mining teams in Australia in 2004 as a foundation to becoming a knowledge-based organization. Today, every tax return lodged in Australia is risk assessed by one or more models developed through data mining, generally based on historic data. We observe that any of the traditional modeling approaches, particularly including random forests, generally deliver similar models in terms of accuracy. We take advantage of combining different model types and modeling approaches for risk scoring, and in particular report on recent research that increases the diversity of trees that make up a random forest. We also review, in a practical context, how such models are evaluated and delivered.

【Keywords】: ensemble models; model evaluation; random forests; risk scoring

127. Leveraging predictive modeling to reduce signal theft in a multi-service organization environment.

Paper Link】 【Pages】:1004

【Authors】: Seymour Douglas

【Abstract】: Signal theft can be defined as the interdiction, consumption or usage of carrier signal from a provider's network without payment or payment of an amount less than the level of service consumed. High levels of signal theft can potentially reflect open technical network issues, failure of electronic countermeasures or operational gaps that are estimated to cost the cable industry providers more than $5 billion annually. This session will discuss the business challenges associated with the quantification of signal theft-related losses, outline some of the countermeasures taken by MSOs, and then provide views on the development of predictive models to help identify the potential likelihood of signal theft in a given environment. We will examine the performance of certain machine learning algorithms as well as data challenges associated with both the architecture construction and analytical efforts, and conclude with a lessons-learned discussion and views on future approaches.

【Keywords】: data analytics; signal-theft; telecommunications

Research session c1: team, trends, and social profiling 4

128. Capacitated team formation problem on social networks.

Paper Link】 【Pages】:1005-1013

【Authors】: Anirban Majumder ; Samik Datta ; K. V. M. Naidu

【Abstract】: In a team formation problem, one is required to find a group of users that can match the requirements of a collaborative task. Example of such collaborative tasks abound, ranging from software product development to various participatory sensing tasks in knowledge creation. Due to the nature of the task, team members are often required to work on a co-operative basis. Previous studies [1, 2] have indicated that co-operation becomes effective in presence of social connections. Therefore, effective team selection requires the team members to be socially close as well as a division of the task among team members so that no user is overloaded by the assignment. In this work, we investigate how such teams can be formed on a social network. Since our team formation problems are proven to be NP-hard, we design efficient approximate algorithms for finding near optimum teams with provable guarantees. As traditional data-sets from on-line social networks (e.g. Twitter, Facebook etc) typically do not contain instances of large scale collaboration, we have crawled millions of software repositories spanning a period of four years and hundreds of thousands of developers from GitHub, a popular open-source social coding network. We perform large scale experiments on this data-set to evaluate the accuracy and efficiency of our algorithms. Experimental results suggest that our algorithms achieve significant improvement in finding effective teams, as compared to naive strategies and scale well with the size of the data. Finally, we provide a validation of our techniques by comparing with existing software teams.

【Keywords】: capacity; social networks; team formation

129. Finding trendsetters in information networks.

Paper Link】 【Pages】:1014-1022

【Authors】: Diego Sáez-Trumper ; Giovanni Comarela ; Virgílio A. F. Almeida ; Ricardo A. Baeza-Yates ; Fabrício Benevenuto

【Abstract】: Influential people have an important role in the process of information diffusion. However, there are several ways to be influential, for example, to be the most popular or the first that adopts a new idea. In this paper we present a methodology to find trendsetters in information networks according to a specific topic of interest. Trendsetters are people that adopt and spread new ideas influencing other people before these ideas become popular. At the same time, not all early adopters are trendsetters because only few of them have the ability of propagating their ideas by their social contacts through word-of-mouth. Differently from other influence measures, a trendsetter is not necessarily popular or famous, but the one whose ideas spread over the graph successfully. Other metrics such as node in-degree or even standard Pagerank focus only in the static topology of the network. We propose a ranking strategy that focuses on the ability of some users to push new ideas that will be successful in the future. To that end, we combine temporal attributes of nodes and edges of the network with a Pagerank based algorithm to find the trendsetters for a given topic. To test our algorithm we conduct innovative experiments over a large Twitter dataset. We show that nodes with high in-degree tend to arrive late for new trends, while users in the top of our ranking tend to be early adopters that also influence their social contacts to adopt the new trend.

【Keywords】: influence; information networks; social networks

130. Towards social user profiling: unified and discriminative influence model for inferring home locations.

Paper Link】 【Pages】:1023-1031

【Authors】: Rui Li ; Shengjie Wang ; Hongbo Deng ; Rui Wang ; Kevin Chen-Chuan Chang

【Abstract】: Users' locations are important to many applications such as targeted advertisement and news recommendation. In this paper, we focus on the problem of profiling users' home locations in the context of social network (Twitter). The problem is nontrivial, because signals, which may help to identify a user's location, are scarce and noisy. We propose a unified discriminative influence model, named as UDI, to solve the problem. To overcome the challenge of scarce signals, UDI integrates signals observed from both social network (friends) and user-centric data (tweets) in a unified probabilistic framework. To overcome the challenge of noisy signals, UDI captures how likely a user connects to a signal with respect to 1) the distance between the user and the signal, and 2) the influence scope of the signal. Based on the model, we develop local and global location prediction methods. The experiments on a large scale data set show that our methods improve the state-of-the-art methods by 13%, and achieve the best performance.

【Keywords】: influence model; location profiling; social network

131. Event-based social networks: linking the online and offline social worlds.

Paper Link】 【Pages】:1032-1040

【Authors】: Xingjie Liu ; Qi He ; Yuanyuan Tian ; Wang-Chien Lee ; John McPherson ; Jiawei Han

【Abstract】: Newly emerged event-based online social services, such as Meetup and Plancast, have experienced increased popularity and rapid growth. From these services, we observed a new type of social network - event-based social network (EBSN). An EBSN does not only contain online social interactions as in other conventional online social networks, but also includes valuable offline social interactions captured in offline activities. By analyzing real data collected from Meetup, we investigated EBSN properties and discovered many unique and interesting characteristics, such as heavy-tailed degree distributions and strong locality of social interactions. We subsequently studied the heterogeneous nature (co-existence of both online and offline social interactions) of EBSNs on two challenging problems: community detection and information flow. We found that communities detected in EBSNs are more cohesive than those in other types of social networks (e.g. location-based social networks). In the context of information flow, we studied the event recommendation problem. By experimenting various information diffusion patterns, we found that a community-based diffusion model that takes into account of both online and offline interactions provides the best prediction power. This paper is the first research to study EBSNs at scale and paves the way for future studies on this new type of social network. A sample dataset of this study can be downloaded from http://www.largenetwork.org/ebsn.

【Keywords】: event based social networks; heterogeneous network; online and offline social behaviors; social event recommendation; social network analysis

Research session c2: privacy 3

132. Differential identifiability.

Paper Link】 【Pages】:1041-1049

【Authors】: Jaewoo Lee ; Chris Clifton

【Abstract】: A key challenge in privacy-preserving data mining is ensuring that a data mining result does not inherently violate privacy. ε-Differential Privacy appears to provide a solution to this problem. However, there are no clear guidelines on how to set ε to satisfy a privacy policy. We give an alternate formulation, Differential Identifiability, parameterized by the probability of individual identification. This provides the strong privacy guarantees of differential privacy, while letting policy makers set parameters based on the established privacy concept of individual identifiability.

【Keywords】: differential privacy; identifiability

133. Anonymizing set-valued data by nonreciprocal recoding.

Paper Link】 【Pages】:1050-1058

【Authors】: Mingqiang Xue ; Panagiotis Karras ; Chedy Raïssi ; Jaideep Vaidya ; Kian-Lee Tan

【Abstract】: Today there is a strong interest in publishing set-valued data in a privacy-preserving manner. Such data associate individuals to sets of values (e.g., preferences, shopping items, symptoms, query logs). In addition, an individual can be associated with a sensitive label (e.g., marital status, religious or political conviction). Anonymizing such data implies ensuring that an adversary should not be able to (1) identify an individual's record, and (2) infer a sensitive label, if such exists. Existing research on this problem either perturbs the data, publishes them in disjoint groups disassociated from their sensitive labels, or generalizes their values by assuming the availability of a generalization hierarchy. In this paper, we propose a novel alternative. Our publication method also puts data in a generalized form, but does not require that published records form disjoint groups and does not assume a hierarchy either; instead, it employs generalized bitmaps and recasts data values in a nonreciprocal manner; formally, the bipartite graph from original to anonymized records does not have to be composed of disjoint complete subgraphs. We configure our schemes to provide popular privacy guarantees while resisting attacks proposed in recent research, and demonstrate experimentally that we gain a clear utility advantage over the previous state of the art.

【Keywords】: anonymization; nonreciprocal recoding; privacy; set-valued data

134. Adversarial support vector machine learning.

Paper Link】 【Pages】:1059-1067

【Authors】: Yan Zhou ; Murat Kantarcioglu ; Bhavani M. Thuraisingham ; Bowei Xi

【Abstract】: Many learning tasks such as spam filtering and credit card fraud detection face an active adversary that tries to avoid detection. For learning problems that deal with an active adversary, it is important to model the adversary's attack strategy and develop robust learning models to mitigate the attack. These are the two objectives of this paper. We consider two attack models: a free-range attack model that permits arbitrary data corruption and a restrained attack model that anticipates more realistic attacks that a reasonable adversary would devise under penalties. We then develop optimal SVM learning strategies against the two attack models. The learning algorithms minimize the hinge loss while assuming the adversary is modifying data to maximize the loss. Experiments are performed on both artificial and real data sets. We demonstrate that optimal solutions may be overly pessimistic when the actual attacks are much weaker than expected. More important, we demonstrate that it is possible to develop a much more resilient SVM learning model while making loose assumptions on the data corruption models. When derived under the restrained attack model, our optimal SVM learning strategy provides more robust overall performance under a wide range of attack parameters.

【Keywords】: adversarial learning; attack models; robust SVM

Research session c3: supervised learning applications 4

135. Web image prediction using multivariate point processes.

Paper Link】 【Pages】:1068-1076

【Authors】: Gunhee Kim ; Fei-Fei Li ; Eric P. Xing

【Abstract】: In this paper, we investigate a problem of predicting what images are likely to appear on the Web at a future time point, given a query word and a database of historical image streams that potentiates learning of uploading patterns of previous user images and associated metadata. We address such a Web image prediction problem at both a collective group level and an individual user level. We develop a predictive framework based on the multivariate point process, which employs a stochastic parametric model to solve the relations between image occurrence and the covariates that influence it, in a flexible, scalable, and globally optimal way. Using Flickr datasets of more than ten million images of 40 topics, our empirical results show that the proposed algorithm is more successful in predicting unseen Web images than other candidate methods, including forecasting on semantic meanings only, a PageRank-based image retrieval, and a generative author-time topic model.

【Keywords】: multivariate point processes; penalized poisson regression; personalization; web image prediction

136. Transductive multi-label ensemble classification for protein function prediction.

Paper Link】 【Pages】:1077-1085

【Authors】: Guo-Xian Yu ; Carlotta Domeniconi ; Huzefa Rangwala ; Guoji Zhang ; Zhiwen Yu

【Abstract】: Advances in biotechnology have made available multitudes of heterogeneous proteomic and genomic data. Integrating these heterogeneous data sources, to automatically infer the function of proteins, is a fundamental challenge in computational biology. Several approaches represent each data source with a kernel (similarity) function. The resulting kernels are then integrated to determine a composite kernel, which is used for developing a function prediction model. Proteins are also found to have multiple roles and functions. As such, several approaches cast the protein function prediction problem within a multi-label learning framework. In our work we develop an approach that takes advantage of several unlabeled proteins, along with multiple data sources and multiple functions of proteins. We develop a graph-based transductive multi-label classifier (TMC) that is evaluated on a composite kernel, and also propose a method for data integration using the ensemble framework, called transductive multi-label ensemble classifier (TMEC). The TMEC approach trains a graph-based multi-label classifier for each individual kernel, and then combines the predictions of the individual models. Our contribution is the use of a bi-relational directed graph that captures relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the ability of TMC and TMEC to predict the functions of proteins by using two yeast datasets. We show that our approach performs better than recently proposed protein function prediction methods on composite and multiple kernels.

【Keywords】: directed bi-relation graph; multi-label ensemble classifier; protein function prediction

137. Multi-domain active learning for text classification.

Paper Link】 【Pages】:1086-1094

【Authors】: Lianghao Li ; Xiaoming Jin ; Sinno Jialin Pan ; Jian-Tao Sun

【Abstract】: Active learning has been proven to be effective in reducing labeling efforts for supervised learning. However, existing active learning work has mainly focused on training models for a single domain. In practical applications, it is common to simultaneously train classifiers for multiple domains. For example, some merchant web sites (like Amazon.com) may need a set of classifiers to predict the sentiment polarity of product reviews collected from various domains (e.g., electronics, books, shoes). Though different domains have their own unique features, they may share some common latent features. If we apply active learning on each domain separately, some data instances selected from different domains may contain duplicate knowledge due to the common features. Therefore, how to choose the data from multiple domains to label is crucial to further reducing the human labeling efforts in multi-domain learning. In this paper, we propose a novel multi-domain active learning framework to jointly select data instances from all domains with duplicate information considered. In our solution, a shared subspace is first learned to represent common latent features of different domains. By considering the common and the domain-specific features together, the model loss reduction induced by each data instance can be decomposed into a common part and a domain-specific part. In this way, the duplicate information across domains can be encoded into the common part of model loss reduction and taken into account when querying. We compare our method with the state-of-the-art active learning approaches on several text classification tasks: sentiment classification, newsgroup classification and email spam filtering. The experiment results show that our method reduces the human labeling efforts by 33.2%, 42.9% and 68.7% on the three tasks, respectively.

【Keywords】: active learning; text classification; transfer learning

138. Modeling disease progression via fused sparse group lasso.

Paper Link】 【Pages】:1095-1103

【Authors】: Jiayu Zhou ; Jun Liu ; Vaibhav A. Narayan ; Jieping Ye

【Abstract】: Alzheimer's Disease (AD) is the most common neurodegenerative disorder associated with aging. Understanding how the disease progresses and identifying related pathological biomarkers for the progression is of primary importance in Alzheimer's disease research. In this paper, we develop novel multi-task learning techniques to predict the disease progression measured by cognitive scores and select biomarkers predictive of the progression. In multi-task learning, the prediction of cognitive scores at each time point is considered as a task, and multiple prediction tasks at different time points are performed simultaneously to capture the temporal smoothness of the prediction models across different time points. Specifically, we propose a novel convex fused sparse group Lasso (cFSGL) formulation that allows the simultaneous selection of a common set of biomarkers for multiple time points and specific sets of biomarkers for different time points using the sparse group Lasso penalty and in the meantime incorporates the temporal smoothness using the fused Lasso penalty. The proposed formulation is challenging to solve due to the use of several non-smooth penalties. We show that the proximal operator associated with the proposed formulation exhibits a certain decomposition property and can be computed efficiently; thus cFSGL can be solved efficiently using the accelerated gradient method. To further improve the model, we propose two non-convex formulations to reduce the shrinkage bias inherent in the convex formulation. We employ the difference of convex programming technique to solve the non-convex formulations. Our extensive experiments using data from the Alzheimer's Disease Neuroimaging Initiative demonstrate the effectiveness of the proposed progression models in comparison with existing methods for disease progression. We also perform longitudinal stability selection to identify and analyze the temporal patterns of biomarkers in disease progression.

【Keywords】: alzheimer's disease; cognitive score; fused lasso; multi-task learning; regression; sparse group lasso

Research session c4: information extraction 4

139. Open domain event extraction from twitter.

Paper Link】 【Pages】:1104-1112

【Authors】: Alan Ritter ; Mausam ; Oren Etzioni ; Sam Clark

【Abstract】: Tweets are the most up-to-date and inclusive stream of in- formation and commentary on current events, but they are also fragmented and noisy, motivating the need for systems that can extract, aggregate and categorize important events. Previous work on extracting structured representations of events has focused largely on newswire text; Twitter's unique characteristics present new challenges and opportunities for open-domain event extraction. This paper describes TwiCal-- the first open-domain event-extraction and categorization system for Twitter. We demonstrate that accurately extracting an open-domain calendar of significant events from Twitter is indeed feasible. In addition, we present a novel approach for discovering important event categories and classifying extracted events based on latent variable models. By leveraging large volumes of unlabeled data, our approach achieves a 14% increase in maximum F1 over a supervised baseline. A continuously updating demonstration of our system can be viewed at http://statuscalendar.com; Our NLP tools are available at http://github.com/aritter/ twitter_nlp.

【Keywords】: information extraction; social media

140. Stratified k-means clustering over a deep web data source.

Paper Link】 【Pages】:1113-1121

【Authors】: Tantan Liu ; Gagan Agrawal

【Abstract】: This paper focuses on the problem of clustering data from a {\em hidden} or a deep web data source. A key characteristic of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. We have developed a new stratified clustering method addressing this problem for a deep web data source. Specifically, we have developed a stratified k-means clustering method. In our approach, the space of input attributes of a deep web data source is stratified for capturing the relationship between the input and the output attributes. The space of output attributes of a deep web data source is partitioned into sub-spaces. Three representative sampling methods are developed in this paper, with the goal of achieving a good estimation of the statistics, including proportions and centers, within the sub-spaces of the output attributes. We have evaluated our methods using two synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of stratification(5%-55%), and our and representative sampling methods(up to 54%).

【Keywords】: clustering; deep web; sampling

141. Metro maps of science.

Paper Link】 【Pages】:1122-1130

【Authors】: Dafna Shahaf ; Carlos Guestrin ; Eric Horvitz

【Abstract】: As the number of scientific publications soars, even the most enthusiastic reader can have trouble staying on top of the evolving literature. It is easy to focus on a narrow aspect of one's field and lose track of the big picture. Information overload is indeed a major challenge for scientists today, and is especially daunting for new investigators attempting to master a discipline and scientists who seek to cross disciplinary borders. In this paper, we propose metrics of influence, coverage and connectivity for scientific literature. We use these metrics to create structured summaries of information, which we call metro maps. Most importantly, metro maps explicitly show the relations between papers in a way which captures developments in the field. Pilot user studies demonstrate that our method helps researchers acquire new knowledge efficiently: map users achieved better precision and recall scores and found more seminal papers while performing fewer searches.

【Keywords】: information; metro maps; summarization

142. Active sampling for entity matching.

Paper Link】 【Pages】:1131-1139

【Authors】: Kedar Bellare ; Suresh Iyengar ; Aditya G. Parameswaran ; Vibhor Rastogi

【Abstract】: In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent paper [1] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case. Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning module that minimizes 0-1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.

【Keywords】: active learning; deduplication; entity matching; imbalanced data

Industry/govt track c5: medical informatics 4

143. An integrated data mining approach to real-time clinical monitoring and deterioration warning.

Paper Link】 【Pages】:1140-1148

【Authors】: Yi Mao ; Wenlin Chen ; Yixin Chen ; Chenyang Lu ; Marin Kollef ; Thomas C. Bailey

【Abstract】: Clinical study found that early detection and intervention are essential for preventing clinical deterioration in patients, for patients both in intensive care units (ICU) as well as in general wards but under real-time data sensing (RDS). In this paper, we develop an integrated data mining approach to give early deterioration warnings for patients under real-time monitoring in ICU and RDS. Existing work on mining real-time clinical data often focus on certain single vital sign and specific disease. In this paper, we consider an integrated data mining approach for general sudden deterioration warning. We synthesize a large feature set that includes first and second order time-series features, detrended fluctuation analysis (DFA), spectral analysis, approximative entropy, and cross-signal features. We then systematically apply and evaluate a series of established data mining methods, including forward feature selection, linear and nonlinear classification algorithms, and exploratory undersampling for class imbalance. An extensive empirical study is conducted on real patient data collected between 2001 and 2008 from a variety of ICUs. Results show the benefit of each of the proposed techniques, and the final integrated approach significantly improves the prediction quality. The proposed clinical warning system is currently under integration with the electronic medical record system at Barnes-Jewish Hospital in preparation for a clinical trial. This work represents a promising step toward general early clinical warning which has the potential to significantly improve the quality of patient care in hospitals.

【Keywords】: deterioration warning; feature selection; real-time clinical monitoring; time-series classification

144. Multi-source learning for joint analysis of incomplete multi-modality neuroimaging data.

Paper Link】 【Pages】:1149-1157

【Authors】: Lei Yuan ; Yalin Wang ; Paul M. Thompson ; Vaibhav A. Narayan ; Jieping Ye

【Abstract】: Incomplete data present serious problems when integrating large-scale brain imaging data sets from different imaging modalities. In the Alzheimer's Disease Neuroimaging Initiative (ADNI), for example, over half of the subjects lack cerebrospinal fluid (CSF) measurements; an independent half of the subjects do not have fluorodeoxyglucose positron emission tomography (FDG-PET) scans; many lack proteomics measurements. Traditionally, subjects with missing measures are discarded, resulting in a severe loss of available information. We address this problem by proposing two novel learning methods where all the samples (with at least one available data source) can be used. In the first method, we divide our samples according to the availability of data sources, and we learn shared sets of features with state-of-the-art sparse learning methods. Our second method learns a base classifier for each data source independently, based on which we represent each source using a single column of prediction scores; we then estimate the missing prediction scores, which, combined with the existing prediction scores, are used to build a multi-source fusion model. To illustrate the proposed approaches, we classify patients from the ADNI study into groups with Alzheimer's disease (AD), mild cognitive impairment (MCI) and normal controls, based on the multi-modality data. At baseline, ADNI's 780 participants (172 AD, 397 MCI, 211 Normal), have at least one of four data types: magnetic resonance imaging (MRI), FDG-PET, CSF and proteomics. These data are used to test our algorithms. Comprehensive experiments show that our proposed methods yield stable and promising results.

【Keywords】: incomplete data; multi-source feature learning; multi-task learning

145. RainMon: an integrated approach to mining bursty timeseries monitoring data.

Paper Link】 【Pages】:1158-1166

【Authors】: Ilari Shafer ; Kai Ren ; Vishnu Naresh Boddeti ; Yoshihisa Abe ; Gregory R. Ganger ; Christos Faloutsos

【Abstract】: Metrics like disk activity and network traffic are widespread sources of diagnosis and monitoring information in datacenters and networks. However, as the scale of these systems increases, examining the raw data yields diminishing insight. We present RainMon, a novel end-to-end approach for mining timeseries monitoring data designed to handle its size and unique characteristics. Our system is able to (a) mine large, bursty, real-world monitoring data, (b) find significant trends and anomalies in the data, (c) compress the raw data effectively, and (d) estimate trends to make forecasts. Furthermore, RainMon integrates the full analysis process from data storage to the user interface to provide accessible long-term diagnosis. We apply RainMon to three real-world datasets from production systems and show its utility in discovering anomalous machines and time periods.

【Keywords】: PCA; bursty data; system monitoring

146. SympGraph: a framework for mining clinical notes through symptom relation graphs.

Paper Link】 【Pages】:1167-1175

【Authors】: Parikshit Sondhi ; Jimeng Sun ; Hanghang Tong ; ChengXiang Zhai

【Abstract】: As an integral part of Electronic Health Records (EHRs), clinical notes pose special challenges for analyzing EHRs due to their unstructured nature. In this paper, we present a general mining framework SympGraph for modeling and analyzing symptom relationships in clinical notes. A SympGraph has symptoms as nodes and co-occurrence relations between symptoms as edges, and can be constructed automatically through extracting symptoms over sequences of clinical notes for a large number of patients. We present an important clinical application of SympGraph: symptom expansion, which can expand a given set of symptoms to other related symptoms by analyzing the underlying SympGraph structure. We further propose a matrix update algorithm which provides a significant computational saving for dynamic updates to the graph. Comprehensive evaluation on 1 million longitudinal clinical notes over 13K patients shows that static symptom expansion can successfully expand a set of known symptoms to a disease with high agreement rate with physician input (average precision 0.46), a 31% improvement over baseline co-occurrence based methods. The experimental results also show that the expanded symptoms can serve as useful features for improving AUC measure for disease diagnosis prediction, thus confirming the potential clinical value of our work.

【Keywords】: patient records; physician notes; random walk; symptom graphs

Industrial practice expo c6: session 4 1

147. Experiences and lessons in developing industry-strength machine learning and data mining software.

Paper Link】 【Pages】:1176

【Authors】: Chih-Jen Lin

【Abstract】: Traditionally academic machine learning and data mining researchers focus on proposing new algorithms. The task of implementing these methods is often left to companies that are developing software packages. However, the gap between the two sides has caused some problems. First, the practical deployment of new algorithms still involves some challenging issues that need to be studied by researchers. Second, without further investigation after publishing their papers, researchers have neither the opportunity to work with real problems nor see how their methods are used. We discuss the experiences in developing two machine learning packages LIBSVM and LIBLINEAR, that are widely used in both academia and industry. We demonstrate that the interaction with users leads us to identify some important research problems. For example, the decision to study and then support multi-class SVM was essential in the early stage of developing LIBSVM. The birth of LIBLINEAR was driven by the need to classify large-scale documents in Internet companies. For fast training of large-scale problems, we had to create new algorithms other than those used in LIBSVM for kernel SVM. We present some practical use of LIBLINEAR for Internet applications. Finally, we give lessons learned and future perspectives for developing industry-strength machine learning and data mining software.

【Keywords】: document classification; machine learning; software development; support vector machines

Research session a1: ads and video recommendation 5

Paper Link】 【Pages】:1177-1185

【Authors】: Weinan Zhang ; Ying Zhang ; Bin Gao ; Yong Yu ; Xiaojie Yuan ; Tie-Yan Liu

【Abstract】: This paper is concerned with the joint allocation of bid price and campaign budget in sponsored search. In this application, an advertiser can create a number of campaigns and set a budget for each of them. In a campaign, he/she can further create several ad groups with bid keywords and bid prices. Data analysis shows that many advertisers are dealing with a very large number of campaigns, bid keywords, and bid prices at the same time, which poses a great challenge to the optimality of their campaign management. As a result, the budgets of some campaigns might be too low to achieve the desired performance goals while those of some other campaigns might be wasted; the bid prices for some keywords may be too low to win competitive auctions while those of some other keywords may be unnecessarily high. In this paper, we propose a novel algorithm to automatically address this issue. In particular, we model the problem as a constrained optimization problem, which maximizes the expected advertiser revenue subject to the constraints of the total budget of the advertiser and the ranges of bid price change. By solving this optimization problem, we can obtain an optimal budget allocation plan as well as an optimal bid price setting. Our simulation results based on the sponsored search log of a commercial search engine have shown that by employing the proposed method, we can effectively improve the performances of the advertisers while at the same time we also see an increase in the revenue of the search engine. In addition, the results indicate that this method is robust to the second-order effects caused by the bid fluctuations from other advertisers.

【Keywords】: bid optimization; budget allocation; sponsored search

149. The untold story of the clones: content-agnostic factors that impact YouTube video popularity.

Paper Link】 【Pages】:1186-1194

【Authors】: Youmna Borghol ; Sebastien Ardon ; Niklas Carlsson ; Derek L. Eager ; Anirban Mahanti

【Abstract】: Video dissemination through sites such as YouTube can have widespread impacts on opinions, thoughts, and cultures. Not all videos will reach the same popularity and have the same impact. Popularity differences arise not only because of differences in video content, but also because of other "content-agnostic" factors. The latter factors are of considerable interest but it has been difficult to accurately study them. For example, videos uploaded by users with large social networks may tend to be more popular because they tend to have more interesting content, not because social network size has a substantial direct impact on popularity. In this paper, we develop and apply a methodology that is able to accurately assess, both qualitatively and quantitatively, the impacts of various content-agnostic factors on video popularity. When controlling for video content, we observe a strong linear "rich-get-richer" behavior, with the total number of previous views as the most important factor except for very young videos. The second most important factor is found to be video age. We analyze a number of phenomena that may contribute to rich-get-richer, including the first-mover advantage, and search bias towards popular videos. For young videos we find that factors other than the total number of previous views, such as uploader characteristics and number of keywords, become relatively more important. Our findings also confirm that inaccurate conclusions can be reached when not controlling for content.

【Keywords】: clones; content popularity; rich-get-richer; youtube

150. SHALE: an efficient algorithm for allocation of guaranteed display advertising.

Paper Link】 【Pages】:1195-1203

【Authors】: Vijay Bharadwaj ; Peiji Chen ; Wenjing Ma ; Chandrashekhar Nagarajan ; John Tomlin ; Sergei Vassilvitskii ; Erik Vee ; Jian Yang

【Abstract】: Motivated by the problem of optimizing allocation in guaranteed display advertising, we develop an efficient, lightweight method of generating a compact allocation plan that can be used to guide ad server decisions. The plan itself uses just O(1) state per guaranteed contract, is robust to noise, and allows us to serve (provably) nearly optimally. The optimization method we develop is scalable, with a small in-memory footprint, and working in linear time per iteration. It is also "stop-anytime", meaning that time-critical applications can stop early and still get a good serving solution. Thus, it is particularly useful for optimizing the large problems arising in the context of display advertising. We demonstrate the effectiveness of our algorithm using actual Yahoo! data.

【Keywords】: display advertising; online advertising

151. Factoring past exposure in display advertising targeting.

Paper Link】 【Pages】:1204-1212

【Authors】: Neha Gupta ; Abhimanyu Das ; Sandeep Pandey ; Vijay K. Narayanan

【Abstract】: Online advertising is becoming more and more performance oriented where the decision to show an advertisement to a user is made based on the user's propensity to respond to the ad in a positive manner, (e.g., purchasing a product, subscribing to an email list). The user response depends on how well the ad campaign matches to the user's interest, as well as the amount of user's past exposure to the campaign - a factor shown to be impactful in controlled experimental studies. Past exposure builds brand-awareness and familiarity with the user, which in turn leads to a higher propensity of the user to buy/convert on the ad impression. In this paper we propose a model of the user response to an ad campaign as a function of both the interest match and the past exposure, where the interest match is estimated using historical search/browse activities of the user. The goal of this paper is two-fold. First, we demonstrate the role played by the user interest and the past exposure in modeling user response by jointly estimating the parameters of these factors. We test this response model over hundreds of real ad campaigns. Second, we use the findings from this joint model to identify more relevant target users for ad campaigns. In particular, we show that on real advertising data this model combines past exposure together with the user profile to identify better target users over the conventional targeting models.

【Keywords】: advertising; latent factors; targeting; user modeling

152. Online allocation of display ads with smooth delivery.

Paper Link】 【Pages】:1213-1221

【Authors】: Anand Bhalgat ; Jon Feldman ; Vahab S. Mirrokni

【Abstract】: Display ads on the Internet are often sold in bundles of thousands or millions of impressions over a particular time period, typically weeks or months. Ad serving systems that assign ads to pages on behalf of publishers must satisfy these contracts, but at the same time try to maximize overall quality of placement. This is usually modeled in the literature as an online allocation problem, where contracts are represented by overall delivery constraints over a finite time horizon. However this model misses an important aspect of ad delivery: time homogeneity. Advertisers who buy these packages expect their ad to be shown smoothly throughout the purchased time period, in order to reach a wider audience, to have a sustained impact, and to support the ads they are running on other media (e.g., television). In this paper we formalize this problem using several nested packing constraints, and develop a tight (1-1/e)-competitive online algorithm for this problem. Our algorithms and analysis require novel techniques as they involve online computation of multiple dual variables per ad. We then show the effectiveness of our algorithms through exhaustive simulation studies on real data sets.

【Keywords】: ad allocation; display ads; online matching; smooth delivery

Research session a2: graph mining 5

153. Streaming graph partitioning for large distributed graphs.

Paper Link】 【Pages】:1222-1230

【Authors】: Isabelle Stanton ; Gabriel Kliot

【Abstract】: Extracting knowledge by performing computations on graphs is becoming increasingly challenging as graphs grow in size. A standard approach distributes the graph over a cluster of nodes, but performing computations on a distributed graph is expensive if large amount of data have to be moved. Without partitioning the graph, communication quickly becomes a limiting factor in scaling the system up. Existing graph partitioning heuristics incur high computation and communication cost on large graphs, sometimes as high as the future computation itself. Observing that the graph has to be loaded into the cluster, we ask if the partitioning can be done at the same time with a lightweight streaming algorithm. We propose natural, simple heuristics and compare their performance to hashing and METIS, a fast, offline heuristic. We show on a large collection of graph datasets that our heuristics are a significant improvement, with the best obtaining an average gain of 76%. The heuristics are scalable in the size of the graphs and the number of partitions. Using our streaming partitioning methods, we are able to speed up PageRank computations on Spark, a distributed computation system, by 18% to 39% for large social networks.

【Keywords】: distributed graphs; experimental evaluation; graph partitioning; streaming algorithms

154. RolX: structural role extraction & mining in large graphs.

Paper Link】 【Pages】:1231-1239

【Authors】: Keith Henderson ; Brian Gallagher ; Tina Eliassi-Rad ; Hanghang Tong ; Sugato Basu ; Leman Akoglu ; Danai Koutra ; Christos Faloutsos ; Lei Li

【Abstract】: Given a network, intuitively two nodes belong to the same role if they have similar structural behavior. Roles should be automatically determined from the data, and could be, for example, "clique-members," "periphery-nodes," etc. Roles enable numerous novel and useful network-mining tasks, such as sense-making, searching for similar nodes, and node classification. This paper addresses the question: Given a graph, how can we automatically discover roles for nodes? We propose RolX (Role eXtraction), a scalable (linear in the number of edges), unsupervised learning approach for automatically extracting structural roles from general network data. We demonstrate the effectiveness of RolX on several network-mining tasks: from exploratory data analysis to network transfer learning. Moreover, we compare network role discovery with network community discovery. We highlight fundamental differences between the two (e.g., roles generalize across disconnected networks, communities do not); and show that the two approaches are complimentary in nature.

【Keywords】: graph mining; network classification; sense-making; similarity search; structural role discovery

155. Fast algorithms for maximal clique enumeration with limited memory.

Paper Link】 【Pages】:1240-1248

【Authors】: James Cheng ; Linhong Zhu ; Yiping Ke ; Shumo Chu

【Abstract】: Maximal clique enumeration (MCE) is a long-standing problem in graph theory and has numerous important applications. Though extensively studied, most existing algorithms become impractical when the input graph is too large and is disk-resident. We first propose an efficient partition-based algorithm for MCE that addresses the problem of processing large graphs with limited memory. We then further reduce the high cost of CPU computation of MCE by a careful nested partition based on a cost model. Finally, we parallelize our algorithm to further reduce the overall running time. We verified the efficiency of our algorithms by experiments in large real-world graphs.

【Keywords】: i/o efficient; massive networks; maximal clique enumeration; parallel algorithm; sparse graphs

156. Summarization-based mining bipartite graphs.

Paper Link】 【Pages】:1249-1257

【Authors】: Jing Feng ; Xiao He ; Bettina Konte ; Christian Böhm ; Claudia Plant

【Abstract】: How to extract the truly relevant information from a large relational data set? The answer of this paper is a technique integrating graph summarization, graph clustering, link prediction and the discovery of the hidden structure on the basis of data compression. Our novel algorithm SCMiner (for Summarization-Compression Miner) reduces a large bipartite input graph to a highly compact representation which is very useful for different data mining tasks: 1) Clustering: The compact summary graph contains the truly relevant clusters of both types of nodes of a bipartite graph. 2) Link prediction: The compression scheme of SCMiner reveals suspicious edges which are probably erroneous as well as missing edges, i.e. pairs of nodes which should be connected by an edge. 3) Discovery of the hidden structure: Unlike traditional co-clustering methods, the result of SCMiner is not limited to row- and column-clusters. Besides the clusters, the summary graph also contains the essential relationships between both types of clusters and thus reveals the hidden structure of the data. Extensive experiments on synthetic and real data demonstrate that SCMiner outperforms state-of-the-art techniques for clustering and link prediction. Moreover, SCMiner discovers the hidden structure and reports it in an interpretable way to the user. Based on data compression, our technique does not rely on any input parameters which are difficult to estimate.

【Keywords】: bipartite graph; clustering; link prediction; summarization

157. Mining coherent subgraphs in multi-layer graphs with edge labels.

Paper Link】 【Pages】:1258-1266

【Authors】: Brigitte Boden ; Stephan Günnemann ; Holger Hoffmann ; Thomas Seidl

【Abstract】: Mining dense subgraphs such as cliques or quasi-cliques is an important graph mining problem and closely related to the notion of graph clustering. In various applications, graphs are enriched by additional information. For example, we can observe graphs representing different types of relations between the vertices. These multiple edge types can also be viewed as different "layers" of the same graph, which is denoted as a "multi-layer graph" in this work. Additionally, each edge might be annotated by a label characterizing the given relation in more detail. By exploiting all these different kinds of information, the detection of more interesting clusters in the graph can be supported. In this work, we introduce the multi-layer coherent subgraph (MLCS) model, which defines clusters of vertices that are densely connected by edges with similar labels in a subset of the graph layers. We avoid redundancy in the result by selecting only the most interesting, non-redundant clusters for the output. Based on this model, we introduce the best-first search algorithm MiMAG. In thorough experiments we demonstrate the strengths of MiMAG in comparison with related approaches on synthetic as well as real-world datasets.

【Keywords】: dense subgraphs; graph clustering; networks

Research session a3: recommendation 5

158. Circle-based recommendation in online social networks.

Paper Link】 【Pages】:1267-1275

【Authors】: Xiwang Yang ; Harald Steck ; Yong Liu

【Abstract】: Online social network information promises to increase recommendation accuracy beyond the capabilities of purely rating/feedback-driven recommender systems (RS). As to better serve users' activities across different domains, many online social networks now support a new feature of "Friends Circles", which refines the domain-oblivious "Friends" concept. RS should also benefit from domain-specific "Trust Circles". Intuitively, a user may trust different subsets of friends regarding different domains. Unfortunately, in most existing multi-category rating datasets, a user's social connections from all categories are mixed together. This paper presents an effort to develop circle-based RS. We focus on inferring category-specific social trust circles from available rating data combined with social network data. We outline several variants of weighting friends within circles based on their inferred expertise levels. Through experiments on publicly available data, we demonstrate that the proposed circle-based recommendation models can better utilize user's social trust information, resulting in increased recommendation accuracy.

【Keywords】: collaborative filtering; friends circles; online social networks; recommender systems

159. Incorporating heterogeneous information for personalized tag recommendation in social tagging systems.

Paper Link】 【Pages】:1276-1284

【Authors】: Wei Feng ; Jianyong Wang

【Abstract】: A social tagging system provides users an effective way to collaboratively annotate and organize items with their own tags. A social tagging system contains heterogeneous information like users' tagging behaviors, social networks, tag semantics and item profiles. All the heterogeneous information helps alleviate the cold start problem due to data sparsity. In this paper, we model a social tagging system as a multi-type graph. To learn the weights of different types of nodes and edges, we propose an optimization framework, called OptRank. OptRank can be characterized as follows:(1) Edges and nodes are represented by features. Different types of edges and nodes have different set of features. (2) OptRank learns the best feature weights by maximizing the average AUC (Area Under the ROC Curve) of the tag recommender. We conducted experiments on two publicly available datasets, i.e., Delicious and Last.fm. Experimental results show that: (1) OptRank outperforms the existing graph based methods when only (user, tag, item) relation is available. (2) OptRank successfully improves the results by incorporating social network, tag semantics and item profiles.

【Keywords】: recommender system; social tagging system

160. Cross-domain collaboration recommendation.

Paper Link】 【Pages】:1285-1293

【Authors】: Jie Tang ; Sen Wu ; Jimeng Sun ; Hang Su

【Abstract】: Interdisciplinary collaborations have generated huge impact to society. However, it is often hard for researchers to establish such cross-domain collaborations. What are the patterns of cross-domain collaborations? How do those collaborations form? Can we predict this type of collaborations? Cross-domain collaborations exhibit very different patterns compared to traditional collaborations in the same domain: 1) sparse connection: cross-domain collaborations are rare; 2) complementary expertise: cross-domain collaborators often have different expertise and interest; 3) topic skewness: cross-domain collaboration topics are focused on a subset of topics. All these patterns violate fundamental assumptions of traditional recommendation systems. In this paper, we analyze the cross-domain collaboration data from research publications and confirm the above patterns. We propose the Cross-domain Topic Learning (CTL) model to address these challenges. For handling sparse connections, CTL consolidates the existing cross-domain collaborations through topic layers instead of at author layers, which alleviates the sparseness issue. For handling complementary expertise, CTL models topic distributions from source and target domains separately, as well as the correlation across domains. For handling topic skewness, CTL only models relevant topics to the cross-domain collaboration. We compare CTL with several baseline approaches on large publication datasets from different domains. CTL outperforms baselines significantly on multiple recommendation metrics. Beyond accurate recommendation performance, CTL is also insensitive to parameter tuning as confirmed in the sensitivity analysis.

【Keywords】: collaboration recommendation; social influence; social network

161. RecMax: exploiting recommender systems for fun and profit.

Paper Link】 【Pages】:1294-1302

【Authors】: Amit Goyal ; Laks V. S. Lakshmanan

【Abstract】: In recent times, collaborative filtering based Recommender Systems (RS) have become extremely popular. While research in recommender systems has mostly focused on improving the accuracy of recommendations, in this paper, we look at the "flip" side of a RS. That is, instead of improving existing recommender algorithms, we ask whether we can use an existing operational RS to launch a targeted marketing campaign. To this end, we propose a novel problem called RecMax that aims to select a set of "seed" users for a marketing campaign for a new product, such that if they endorse the product by providing relatively high ratings, the number of other users to whom the product is recommended by the underlying RS algorithm is maximum. We motivate RecMax with real world applications. We show that seeding can make a substantial difference, if done carefully. We prove that RecMax is not only NP-hard to solve optimally, it is NP-hard to even approximate within any reasonable factor. Given this hardness, we explore several natural heuristics on 3 real world datasets - Movielens, Yahoo! Music and Jester Joke and report our findings. We show that even though RecMax is hard to approximate, simple natural heuristics may provide impressive gains, for targeted marketing using RS.

【Keywords】: collaborative filtering; maximization; recommender systems; seed set selection; targeted marketing

162. Learning personal + social latent factor model for social recommendation.

Paper Link】 【Pages】:1303-1311

【Authors】: Yelong Shen ; Ruoming Jin

【Abstract】: Social recommendation, which aims to systematically leverage the social relationships between users as well as their past behaviors for automatic recommendation, attract much attention recently. The belief is that users linked with each other in social networks tend to share certain common interests or have similar tastes (homophily principle); such similarity is expected to help improve the recommendation accuracy and quality. There have been a few studies on social recommendations; however, they almost completely ignored the heterogeneity and diversity of the social relationship. In this paper, we develop a joint personal and social latent factor (PSLF) model for social recommendation. Specifically, it combines the state-of-the-art collaborative filtering and the social network modeling approaches for social recommendation. Especially, the PSLF extracts the social factor vectors for each user based on the state-of-the-art mixture membership stochastic blockmodel, which can explicitly express the varieties of the social relationship. To optimize the PSLF model, we develop a scalable expectation-maximization (EM) algorithm, which utilizes a novel approximate mean-field technique for fast expectation computation. We compare our approach with the latest social recommendation approaches on two real datasets, Flixter and Douban (both with large social networks). With similar training cost, our approach has shown a significant improvement in terms of prediction accuracy criteria over the existing approaches.

【Keywords】: personal + social factor; social recommender system

Research session a4: clustering 5

163. Two approaches to understanding when constraints help clustering.

Paper Link】 【Pages】:1312-1320

【Authors】: Ian Davidson

【Abstract】: Most algorithm work in data mining focuses on designing algorithms to address a learning problem. Here we focus our attention on designing algorithms to determine the ease or difficulty of a problem instance. The area of clustering under constraints has recently received much attention in the data mining community. We can view the constraints as restricting (either directly or indirectly) the search space of a clustering algorithm to just feasible clusterings. However, to our knowledge no work explores methods to count the feasible clusterings or other measures of difficulty nor the importance of these measures. We present two approaches to efficiently characterize the difficulty of satisfying must-link (ML) and cannot-link (CL) constraints: calculating the fractional chromatic polynomial of the constraint graph using LP and approximately counting the number of feasible clusterings using MCMC samplers. We show that these measures are correlated to the classical performance measures of constrained clustering algorithms. From these insights and our algorithms we construct new methods of generating and pruning constraints and empirically demonstrate their usefulness.

【Keywords】: clustering; constraints; semi-supervised learning

164. Chromatic correlation clustering.

Paper Link】 【Pages】:1321-1329

【Authors】: Francesco Bonchi ; Aristides Gionis ; Francesco Gullo ; Antti Ukkonen

【Abstract】: We study a novel clustering problem in which the pairwise relations between objects are categorical. This problem can be viewed as clustering the vertices of a graph whose edges are of different types (colors). We introduce an objective function that aims at partitioning the graph such that the edges within each cluster have, as much as possible, the same color. We show that the problem is NP-hard and propose a randomized algorithm with approximation guarantee proportional to the maximum degree of the input graph. The algorithm iteratively picks a random edge as pivot, builds a cluster around it, and removes the cluster from the graph. Although being fast, easy-to-implement, and parameter free, this algorithm tends to produce a relatively large number of clusters. To overcome this issue we introduce a variant algorithm, which modifies how the pivot is chosen and and how the cluster is built around the pivot. Finally, to address the case where a fixed number of output clusters is required, we devise a third algorithm that directly optimizes the objective function via a strategy based on the alternating minimization paradigm. We test our algorithms on synthetic and real data from the domains of protein-interaction networks, social media, and bibliometrics. Experimental evidence show that our algorithms outperform a baseline algorithm both in the task of reconstructing a ground-truth clustering and in terms of objective function value.

【Keywords】: clustering; edge-labeled graphs

165. Locally-scaled spectral clustering using empty region graphs.

Paper Link】 【Pages】:1330-1338

【Authors】: Carlos D. Correa ; Peter Lindstrom

【Abstract】: This paper introduces a new method for estimating the local neighborhood and scale of data points to improve the robustness of spectral clustering algorithms. We employ a subset of empty region graphs - the β-skeleton - and non-linear diffusion to define a locally-adapted affinity matrix, which, as we demonstrate, provides higher quality clustering than conventional approaches based on κ nearest neighbors or global scale parameters. Moreover, we show that the clustering quality is far less sensitive to the choice of β and other algorithm parameters, and to transformations such as geometric distortion and random perturbation. We summarize the results of an empirical study that applies our method to a number of 2D synthetic data sets, consisting of clusters of arbitrary shape and scale, and to real multi-dimensional classification examples from benchmarks, including image segmentation.

【Keywords】: proximity graphs; spectral clustering

166. Active spectral clustering via iterative uncertainty reduction.

Paper Link】 【Pages】:1339-1347

【Authors】: Fabian L. Wauthier ; Nebojsa Jojic ; Michael I. Jordan

【Abstract】: Spectral clustering is a widely used method for organizing data that only relies on pairwise similarity measurements. This makes its application to non-vectorial data straight-forward in principle, as long as all pairwise similarities are available. However, in recent years, numerous examples have emerged in which the cost of assessing similarities is substantial or prohibitive. We propose an active learning algorithm for spectral clustering that incrementally measures only those similarities that are most likely to remove uncertainty in an intermediate clustering solution. In many applications, similarities are not only costly to compute, but also noisy. We extend our algorithm to maintain running estimates of the true similarities, as well as estimates of their accuracy. Using this information, the algorithm updates only those estimates which are relatively inaccurate and whose update would most likely remove clustering uncertainty. We compare our methods on several datasets, including a realistic example where similarities are expensive and noisy. The results show a significant improvement in performance compared to the alternatives.

【Keywords】: active learning; spectral clustering

167. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks.

Paper Link】 【Pages】:1348-1356

【Authors】: Yizhou Sun ; Brandon Norick ; Jiawei Han ; Xifeng Yan ; Philip S. Yu ; Xiao Yu

【Abstract】: Real-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is its potential to generate many different results, carrying rather diverse semantic meanings. In order to generate desired clustering, we propose to use meta-path, a path that connects object types via a sequence of relations, to control clustering with distinct semantics. Nevertheless, it is easier for a user to provide a few examples ("seeds") than a weighted combination of sophisticated meta-paths to specify her clustering preference. Thus, we propose to integrate meta-path selection with user-guided clustering to cluster objects in networks, where a user first provides a small set of object seeds for each cluster as guidance. Then the system learns the weights for each meta-path that are consistent with the clustering result implied by the guidance, and generates clusters under the learned weights of meta-paths. A probabilistic approach is proposed to solve the problem, and an effective and efficient iterative algorithm, PathSelClus, is proposed to learn the model, where the clustering quality and the meta-path weights are mutually enhancing each other. Our experiments with several clustering tasks in two real networks demonstrate the power of the algorithm in comparison with the baselines.

【Keywords】: heterogeneous information networks; meta-path selection; user guided clustering

Industry/govt track a5: intelligent systems 5

168. Design principles of massive, robust prediction systems.

Paper Link】 【Pages】:1357-1365

【Authors】: Troy Raeder ; Ori Stitelman ; Brian Dalessandro ; Claudia Perlich ; Foster J. Provost

【Abstract】: Most data mining research is concerned with building high-quality classification models in isolation. In massive production systems, however, the ability to monitor and maintain performance over time while growing in size and scope is equally important. Many external factors may degrade classification performance including changes in data distribution, noise or bias in the source data, and the evolution of the system itself. A well-functioning system must gracefully handle all of these. This paper lays out a set of design principles for large-scale autonomous data mining systems and then demonstrates our application of these principles within the m6d automated ad targeting system. We demonstrate a comprehensive set of quality control processes that allow us monitor and maintain thousands of distinct classification models automatically, and to add new models, take on new data, and correct poorly-performing models without manual intervention or system disruption.

【Keywords】: data mining systems; data monitoring; quality control

169. PatentMiner: topic-driven patent analysis and mining.

Paper Link】 【Pages】:1366-1374

【Authors】: Jie Tang ; Bo Wang ; Yang Yang ; Po Hu ; Yanting Zhao ; Xinyu Yan ; Bo Gao ; Minlie Huang ; Peng Xu ; Weichang Li ; Adam K. Usadi

【Abstract】: Patenting is one of the most important ways to protect company's core business concepts and proprietary technologies. Analyzing large volume of patent data can uncover the potential competitive or collaborative relations among companies in certain areas, which can provide valuable information to develop strategies for intellectual property (IP), R&D, and marketing. In this paper, we present a novel topic-driven patent analysis and mining system. Instead of merely searching over patent content, we focus on studying the heterogeneous patent network derived from the patent database, which is represented by several types of objects (companies, inventors, and technical content) jointly evolving over time. We design and implement a general topic-driven framework for analyzing and mining the heterogeneous patent network. Specifically, we propose a dynamic probabilistic model to characterize the topical evolution of these objects within the patent network. Based on this modeling framework, we derive several patent analytics tools that can be directly used for IP and R&D strategy planning, including a heterogeneous network co-ranking method, a topic-level competitor evolution analysis algorithm, and a method to summarize the search results. We evaluate the proposed methods on a real-world patent database. The experimental results show that the proposed techniques clearly outperform the corresponding baseline methods.

【Keywords】: company ranking; competitor analysis; patent analysis; social network

170. Storytelling in entity networks to support intelligence analysts.

Paper Link】 【Pages】:1375-1383

【Authors】: M. Shahriar Hossain ; Patrick Butler ; Arnold P. Boedihardjo ; Naren Ramakrishnan

【Abstract】: Intelligence analysts grapple with many challenges, chief among them is the need for software support in storytelling, i.e., automatically 'connecting the dots' between disparate entities (e.g., people, organizations) in an effort to form hypotheses and suggest non-obvious relationships. We present a system to automatically construct stories in entity networks that can help form directed chains of relationships, with support for co-referencing, evidence marshaling, and imposing syntactic constraints on the story generation process. A novel optimization technique based on concept lattice mining enables us to rapidly construct stories on massive datasets. Using several public domain datasets, we illustrate how our approach overcomes many limitations of current systems and enables the analyst to efficiently narrow down to hypotheses of interest and reason about alternative explanations.

【Keywords】: connecting the dots; intelligence analysis; redescriptions; storytelling

171. A framework for robust discovery of entity synonyms.

Paper Link】 【Pages】:1384-1392

【Authors】: Kaushik Chakrabarti ; Surajit Chaudhuri ; Tao Cheng ; Dong Xin

【Abstract】: Entity synonyms are critical for many applications like information retrieval and named entity recognition in documents. The current trend is to automatically discover entity synonyms using statistical techniques on web data. Prior techniques suffer from several limitations like click log sparsity and inability to distinguish between entities of different concept classes. In this paper, we propose a general framework for robustly discovering entity synonym with two novel similarity functions that overcome the limitations of prior techniques. We develop efficient and scalable techniques leveraging the MapReduce framework to discover synonyms at large scale. To handle long entity names with extraneous tokens, we propose techniques to effectively map long entity names to short queries in query log. Our experiments on real data from different entity domains demonstrate the superior quality of our synonyms as well as the efficiency of our algorithms. The entity synonyms produced by our system is in production in Bing Shopping and Video search, with experiments showing the significance it brings in improving search experience.

【Keywords】: entity synonym; pseudo document similarity; query context similarity; robust synonym discovery

172. SmartDispatch: enabling efficient ticket dispatch in an IT service environment.

Paper Link】 【Pages】:1393-1401

【Authors】: Shivali Agarwal ; Renuka Sindhgatta ; Bikram Sengupta

【Abstract】: In an IT service delivery environment, the speedy dispatch of a ticket to the correct resolution group is the crucial first step in the problem resolution process. The size and complexity of such environments make the dispatch decision challenging, and incorrect routing by a human dispatcher can lead to significant delays that degrade customer satisfaction, and also have adverse financial implications for both the customer and the IT vendor. In this paper, we present SmartDispatch, a learning-based tool that seeks to automate the process of ticket dispatch while maintaining high accuracy levels. SmartDispatch comes with two classification approaches - the well-known SVM method, and a discriminative term-based approach that we designed to address some of the issues in SVM classification that were empirically observed. Using a combination of these approaches, SmartDispatch is able to automate the dispatch of a ticket to the correct resolution group for a large share of the tickets, while for the rest, it is able to suggest a short list of 3-5 groups that contain the correct resolution group with a high probability. Empirical evaluation of SmartDispatch on data from 3 large service engagement projects in IBM demonstrate the efficacy and practical utility of the approach.

【Keywords】: SVM classification; automated and advisory mode dispatch; discriminative term weighting; ticket resolution group

Asia-Pacific track a6: session 3 4

173. Social media data analysis for revealing collective behaviors.

Paper Link】 【Pages】:1402

【Authors】: Aoying Zhou ; Weining Qian ; Haixin Ma

【Abstract】: Along with the development of Web 2.0 applications, social media services has attracted many users and become their hands-on toolkits for recording life, sharing ideas, and social networking. Though social media services are essentially web or mobile applications and services, they combine user-generated content and social networks together, so that information can be created, transmitted, transformed, and consumed in the cyberspace. Thus, social media somehow could be regarded as a kind of sensor to the real life of its users. In general, the data from social media is of low quality. Pieces of information in social media are usually short, with informal presentation, and in some specific context that is highly related to the physical world. Therefore, it is challenging to extract semantics from social media data. However, we argue that given sufficient social media data, users' collective behaviors could be sensed, studied, and even predicted in a certain circumstance. Our study is conducted on data from two services, i.e. Twitter, and Sina Weibo, the most popular microblogging services all over the world and in China, respectively. Collective behaviors are actions of a large amount of various people, which are neither conforming nor deviant. Various collective behaviors are studied in the context of social media. Our studies show that there are various information flow patterns in social media, some of which are similar to traditional media such as newspapers, while others are embedded deep in the social network structure. The evolution of hotspots is highly affected by external stimulation, the social network structure, and individual user's activities. Furthermore, social media tends to be immune to some repeated similar external stimulations. Last but not the least, there is considerable difference in users' behavior between Twitter and Sina Weibo.

【Keywords】: collective behavior; data analysis; social media

174. Information processing in social networks.

Paper Link】 【Pages】:1403

【Authors】: Ming-Syan Chen

【Abstract】: In the current social network, a user may have hundreds of friends and find it very time consuming to categorize and tag every friend manually. When a user is going to initiate an activity by issuing a corresponding query, he/she needs to consider the relationship among candidate attendees to find a group of mutually close friends. Meanwhile, he/she also needs to consider the schedule of candidate attendees to find an activity period available for all attendees. It would certainly be desirable if the efficiency of such process is improved. In this talk, information processing in social networks will first be reviewed in three phrases, namely (i) from content to social relationship, (ii) mining on social relationship, and (iii) from social relationship to content organization. In addition, we shall present an effective procedure which helps a user to organize an event with proper attendees with minimum total social distance and commonly available time. Moreover, it is noted that the information retrieved from the social networks is also able to facilitate those user-dependent and human-centric services. In light of this, we shall explore the quality of recommendation through incorporating the notion of social filtering and collaborative filtering. Finally, it is recognized that the cloud computing has offered many new capabilities of storing and processing huge amounts of heterogeneous data in social networks. In view of this, we shall also examine how this paradigm shift will affect the information processing in social networks.

【Keywords】: recommendation systems; social network

Paper Link】 【Pages】:1404

【Authors】: Gordon Sun

【Abstract】: To fulfill users' search needs, the search engine must have good performance, easy-to-use functionalities, and good search result quality. Search quality evaluation becomes challenging when users' satisfaction may not be able to judge by a single search and even within a single search judgments from various sources are not consistent. In this talk, I will discuss how user's satisfaction is decomposed into different components in general, and how we measure them with various means - human judgment, automatic computation with query log, and outsourcing, and their pros and cons with operational implications. For an outlook, I will postulate potential evaluation approaches for a better user's satisfaction.

【Keywords】: search engine performance

Paper Link】 【Pages】:1405

【Authors】: Cuiping Li

【Abstract】: Recently there has been a lot of interest in graph-based analysis. One of the most important aspects of graph-based analysis is to measure similarity between nodes and to do similarity search in a graph. For example, in social networks such as Facebook, system may want to recommend potential friends to a particular user based on connections between users. In custom-product networks such as eBay, one may wish to recommend products to others based on purchases history. In this talk, I will introduce some methods on vertex similarities computations and their applications on similarity search in real world networks.

【Keywords】: graph mining; similarity query; social network

Research session b1: keywords and documents 4

177. Large-scale learning of word relatedness with constraints.

Paper Link】 【Pages】:1406-1414

【Authors】: Guy Halawi ; Gideon Dror ; Evgeniy Gabrilovich ; Yehuda Koren

【Abstract】: Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process. We learn for each word a low-dimensional representation, which strives to maximize the likelihood of a word given the contexts in which it appears. Our method, called CLEAR, is shown to significantly outperform previously published approaches. The proposed method is based on first principles, and is generic enough to exploit diverse types of text corpora, while having the flexibility to impose constraints on the derived word similarities. We also make publicly available a new labeled dataset for evaluating word relatedness algorithms, which we believe to be the largest such dataset to date.

【Keywords】: semantic similarity; word relatedness

178. Latent association analysis of document pairs.

Paper Link】 【Pages】:1415-1423

【Authors】: Gengxin Miao ; Ziyu Guan ; Louise E. Moser ; Xifeng Yan ; Shu Tao ; Nikos Anerousis ; Jimeng Sun

【Abstract】: This paper presents Latent Association Analysis (LAA), a generative model that analyzes the topics within two document sets simultaneously, as well as the correlations between the two topic structures, by considering the semantic associations among document pairs. LAA defines a correlation factor that represents the connection between two documents, and considers the topic proportion of paired documents based on this factor. Words in the documents are assumed to be randomly generated by particular topic assignments and topic-to-word probability distributions. The paper also presents a new ranking algorithm, based on LAA, that can be used to retrieve target documents that are potentially associated with a given source document. The ranking algorithm uses the latent factor in LAA to rank target documents by the strength of their semantic associations with the source document. We evaluate the LAA algorithm with real datasets, specifically, the IT-Change and the IT-Solution document sets from the IBM IT service environment and the Symptom-Treatment document sets from Google Health. Experimental results demonstrate that the LAA algorithm significantly outperforms existing algorithms.

【Keywords】: ranking algorithm; topic model; variational inference

Paper Link】 【Pages】:1424-1432

【Authors】: Wei Shen ; Jianyong Wang ; Ping Luo ; Min Wang

【Abstract】: A critical step in bridging the knowledge base with the huge corpus of semi-structured Web list data is to link the entity mentions that appear in the Web lists with the corresponding real world entities in the knowledge base, which we call list linking task. This task can facilitate many different tasks such as knowledge base population, entity search and table annotation. However, the list linking task is challenging because a Web list has almost no textual context, and the only input for this task is a list of entity mentions extracted from the Web pages. In this paper, we propose LIEGE, the first general framework to Link the entities in web lists with the knowledge base to the best of our knowledge. Our assumption is that entities mentioned in a Web list can be any collection of entities that have the same conceptual type that people have in mind. To annotate the list items in a Web list with entities that they likely mention, we leverage the prior probability of an entity being mentioned and the global coherence between the types of entities in the Web list. The interdependence between different entity assignments in a Web list makes the optimization of this list linking problem NP-hard. Accordingly, we propose a practical solution based on the iterative substitution to jointly optimize the identification of the mapping entities for the Web list items. We extensively evaluated the performance of our proposed framework over both manually annotated real Web lists extracted from the Web pages and two public data sets, and the experimental results show that our framework significantly outperforms the baseline method in terms of accuracy.

【Keywords】: iterative substitution; knowledge base; list linking; similarity metric

180. Automatic taxonomy construction from keywords.

Paper Link】 【Pages】:1433-1441

【Authors】: Xueqing Liu ; Yangqiu Song ; Shixia Liu ; Haixun Wang

【Abstract】: Taxonomies, especially the ones in specific domains, are becoming indispensable to a growing number of applications. State-of-the-art approaches assume there exists a text corpus to accurately characterize the domain of interest, and that a taxonomy can be derived from the text corpus using information extraction techniques. In reality, neither assumption is valid, especially for highly focused or fast-changing domains. In this paper, we study a challenging problem: Deriving a taxonomy from a set of keyword phrases. A solution can benefit many real life applications because i) keywords give users the flexibility and ease to characterize a specific domain; and ii) in many applications, such as online advertisements, the domain of interest is already represented by a set of keywords. However, it is impossible to create a taxonomy out of a keyword set itself. We argue that additional knowledge and contexts are needed. To this end, we first use a general purpose knowledgebase and keyword search to supply the required knowledge and context. Then we develop a Bayesian approach to build a hierarchical taxonomy for a given set of keywords. We reduce the complexity of previous hierarchical clustering approaches from O(n2 log n) to O(n log n), so that we can derive a domain specific taxonomy from one million keyword phrases in less than an hour. Finally, we conduct comprehensive large scale experiments to show the effectiveness and efficiency of our approach. A real life example of building an insurance-related query taxonomy illustrates the usefulness of our approach for specific domains.

【Keywords】: bayesian rose tree; hierarchical clustering; knowledgebase; query understanding; taxonomy building

Research session b2: patterns 3

181. An enhanced relevance criterion for more concise supervised pattern discovery.

Paper Link】 【Pages】:1442-1450

【Authors】: Henrik Großkreutz ; Daniel Paurat ; Stefan Rüping

【Abstract】: Supervised local pattern discovery aims to find subsets of a database with a high statistical unusualness in the distribution of a target attribute. Local pattern discovery is often used to generate a human-understandable representation of the most interesting dependencies in a data set. Hence, the more crisp and concise the output is, the better. Unfortunately, standard algorithm often produce very large and redundant outputs. In this paper, we introduce delta-relevance, a definition of a more strict criterion of relevance. It will allow us to significantly reduce the output space, while being able to guarantee that every local pattern has a delta-relevant representative which is almost as good in a clearly defined sense. We show empirically that delta-relevance leads to a considerable reduction of the amount of returned patterns. We also demonstrate that in a top-k setting, the removal of not delta-relevant patterns improves the quality of the result set.

【Keywords】: local patterns; subgroup discovery; theory of relevance

182. Efficient frequent item counting in multi-core hardware.

Paper Link】 【Pages】:1451-1459

【Authors】: Pratanu Roy ; Jens Teubner ; Gustavo Alonso

【Abstract】: The increasing number of cores and the rich instruction sets of modern hardware are opening up new opportunities for optimizing many traditional data mining tasks. In this paper we demonstrate how to speed up the performance of the computation of frequent items by almost one order of magnitude over the best published results by matching the algorithm to the underlying hardware architecture. We start with the observation that frequent item counting, like other data mining tasks, assumes certain amount of skew in the data. We exploit this skew to design a new algorithm that uses a pre-filtering stage that can be implemented in a highly efficient manner through SIMD instructions. Using pipelining, we then combine this pre-filtering stage with a conventional frequent item algorithm (Space-Saving) that will process the remainder of the data. The resulting operator can be parallelized with a small number of cores, leading to a parallel implementation that does not suffer any of the overheads of existing parallel solutions when querying the results and offers significantly higher throughput.

【Keywords】: data flow; frequent items; parallelism

183. On nested palindromes in clickstream data.

Paper Link】 【Pages】:1460-1468

【Authors】: Michel Speiser ; Gianluca Antonini ; Abderrahim Labbi ; Juliana Sutanto

【Abstract】: In this paper we discuss an interesting and useful property of clickstream data. Often a visit includes repeated views of the same page. We show that in three real datasets, sampled from the websites of technology and consulting groups and a news broadcaster, page repetitions occur for the majority as a very specific structure, namely in the form of nested palindromes. This can be explained by the widespread use of features which are available in any web browser: the "refresh" and "back" buttons. Among the types of patterns which can be mined from sequence data, many either stumble if symbol repetitions are involved, or else fail to capture interesting aspects related to symbol repetitions. In an attempt to remedy this, we characterize the palindromic structures, and discuss possible ways of making use of them. One way is to pre-process the sequence data by explicitly inserting these structures, in order to obtain a richer output from conventional mining algorithms. Another application we discuss is to use the information directly, in order to analyze certain aspects of the website under study. We also provide the simple linear-time algorithm which we developed to identify and extract the structures from our data.

【Keywords】: backtrack; palindrome; refresh; web usage mining

Research session b3: spatial and pattern recognition 3

184. Mining discriminative components with low-rank and sparsity constraints for face recognition.

Paper Link】 【Pages】:1469-1477

【Authors】: Qiang Zhang ; Baoxin Li

【Abstract】: This paper introduces a novel image decomposition approach for an ensemble of correlated images, using low-rank and sparsity constraints. Each image is decomposed as a combination of three components: one common component, one condition component, which is assumed to be a low-rank matrix, and a sparse residual. For a set of face images of Nsubjects, the decomposition finds N common components, one for each subject, K low-rank components, each capturing a different global condition of the set (e.g., different illumination conditions), and a sparse residual for each input image. Through this decomposition, the proposed approach recovers a clean face image (the common component) for each subject and discovers the conditions (the condition components and the sparse residuals) of the images in the set. The set of N+K images containing only the common and the low-rank components form a compact and discriminative representation for the original images. We design a classifier using only these N+K images. Experiments on commonly-used face data sets demonstrate the effectiveness of the approach for face recognition through comparing with the leading state-of-the-art in the literature. The experiments further show good accuracy in classifying the condition of an input image, suggesting that the components from the proposed decomposition indeed capture physically meaningful features of the input.

【Keywords】: component decomposition; face recognition; low-rank matrix; sparse matrix; subspace learning

185. Fast algorithms for comprehensive n-point correlation estimates.

Paper Link】 【Pages】:1478-1486

【Authors】: William B. March ; Andrew J. Connolly ; Alexander G. Gray

【Abstract】: The n-point correlation functions (npcf) are powerful spatial statistics capable of fully characterizing any set of multidimensional points. These functions are critical in key data analyses in astronomy and materials science, among other fields, for example to test whether two point sets come from the same distribution and to validate physical models and theories. For example, the npcf has been used to study the phenomenon of dark energy, considered one of the major breakthroughs in recent scientific discoveries. Unfortunately, directly estimating the continuous npcf at a single value requires O(Nn) time for $N$ points, and n may be 2, 3, 4 or even higher, depending on the sensitivity required. In order to draw useful conclusions about real scientific problems, we must repeat this expensive computation both for many different scales in order to derive a smooth estimate and over many different subsamples of our data in order to bound the variance. We present the first comprehensive approach to the entire n-point correlation function estimation problem, including fast algorithms for the computation at multiple scales and for many subsamples. We extend the current state-of-the-art tree-based approach with these two algorithms. We show an order-of-magnitude speedup over the current best approach with each of our new algorithms and show that they can be used together to obtain over 500x speedups over the state-of-the-art in order to enable much larger datasets and more accurate scientific analyses than were possible previously.

【Keywords】: jackknife resampling; n-point correlation functions

186. On "one of the few" objects.

Paper Link】 【Pages】:1487-1495

【Authors】: You Wu ; Pankaj K. Agarwal ; Chengkai Li ; Jun Yang ; Cong Yu

【Abstract】: Objects with multiple numeric attributes can be compared within any "subspace" (subset of attributes). In applications such as computational journalism, users are interested in claims of the form: Karl Malone is one of the only two players in NBA history with at least 25,000 points, 12,000 rebounds, and 5,000 assists in one's career. One challenge in identifying such "one-of-the-k" claims (k = 2 above) is ensuring their "interestingness". A small k is not a good indicator for interestingness, as one can often make such claims for many objects by increasing the dimensionality of the subspace considered. We propose a uniqueness-based interestingness measure for one-of-the-few claims that is intuitive for non-technical users, and we design algorithms for finding all interesting claims (across all subspaces) from a dataset. Sometimes, users are interested primarily in the objects appearing in these claims. Building on our notion of interesting claims, we propose a scheme for ranking objects and an algorithm for computing the top-ranked objects. Using real-world datasets, we evaluate the efficiency of our algorithms as well as the advantage of our object-ranking scheme over popular methods such as Kemeny optimal rank aggregation and weighted-sum ranking.

【Keywords】: computational journalism; fact checking; ranking; skyband

Demonstration track 20

187. BC-PDM: data mining, social network analysis and text mining system based on cloud computing.

Paper Link】 【Pages】:1496-1499

【Authors】: Le Yu ; Jian Zheng ; Wei Chong Shen ; Bin Wu ; Bai Wang ; Long Qian ; Bo Ren Zhang

【Abstract】: Telecom BI(Business Intelligence) system consists of a set of application programs and technologies for gathering, storing, analyzing and providing access to data, which contribute to manage business information and make decision precisely. However, traditional analysis algorithms meet new challenges as the continued exponential growth in both the volume and the complexity of telecom data. With the Cloud Computing development, some parallel data analysis systems have been emerging. However, existing systems have rarely comprehensive function, either providing data analysis service or providing social network analysis. We need a comprehensive tool to store and analysis large scale data efficiently. In response to the challenge, the SaaS (Software-as-a-Service) BI system, BC-PDM (Big Cloud-Parallel Data Mining), are proposed. BC-PDM supports parallel ETL process, statistical analysis, data mining, text mining and social network analysis which are based on Hadoop. This demo introduces three tasks: business recommendation, customer community detection and user preference classification by employing a real telecom data set. Experimental results show BC-PDM is very efficient and effective for intelligence data analysis.

【Keywords】: SNA; hadoop; parallel data mining; text mining

188. Query-driven discovery of semantically similar substructures in heterogeneous networks.

Paper Link】 【Pages】:1500-1503

【Authors】: Xiao Yu ; Yizhou Sun ; Peixiang Zhao ; Jiawei Han

【Abstract】: Heterogeneous information networks that contain multiple types of objects and links are ubiquitous in the real world, such as bibliographic networks, cyber-physical networks, and social media networks. Although researchers have studied various data mining tasks in information networks, interactive query-based network exploration techniques have not been addressed systematically, which, in fact, are highly desirable for exploring large-scale information networks. In this demo, we introduce and demonstrate our recent research project on query-driven discovery of semantically similar substructures in heterogeneous networks. Given a subgraph query, our system searches a given large information network and finds efficiently a list of subgraphs that are structurally identical and semantically similar. Since data mining methods are used to obtain semantically similar entities (nodes), we use discovery as a term to describe this process. In order to achieve high efficiency and scalability, we design and implement a filter-and verification search framework, which can first generate promising subgraph candidates using off line indices built by data mining results, and then verify candidates with a recursive pruning matching process. The proposed system demonstrates the effectiveness of our query-driven semantic similarity search framework and the efficiency of the proposed methodology on multiple real-world heterogeneous information networks.

【Keywords】: data mining; graph query; information network

189. DAGger: clustering correlated uncertain data (to predict asset failure in energy networks).

Paper Link】 【Pages】:1504-1507

【Authors】: Dan Olteanu ; Sebastiaan J. van Schaik

【Abstract】: DAGger is a clustering algorithm for uncertain data. In contrast to prior work, DAGger can work on arbitrarily correlated data and can compute both exact and approximate clusterings with error guarantees. We demonstrate DAGger using a real-world scenario in which partial discharge data from UK Power Networks is clustered to predict asset failure in the energy network.

【Keywords】: classification; clustering; correlations; dagger; partial discharge; probabilistic data; uncertain data

190. UFIMT: an uncertain frequent itemset mining toolbox.

Paper Link】 【Pages】:1508-1511

【Authors】: Yongxin Tong ; Lei Chen ; Philip S. Yu

【Abstract】: In recent years, mining frequent itemsets over uncertain data has attracted much attention in the data mining community. Unlike the corresponding problem in deterministic data, the frequent itemset under uncertain data has two different definitions: the expected support-based frequent itemset and the probabilistic frequent itemset. Most existing works only focus on one of the definitions and no comprehensive study is conducted to compare the two different definitions. Moreover, due to lacking the uniform implementation platform, existing solutions for the same definition even generate inconsistent results. In this demo, we present a demonstration called as UFIMT (underline Uncertain Frequent Itemset Mining Toolbox) which not only discovers frequent itemsets over uncertain data but also compares the performance of different algorithms and demonstrates the relationship between different definitions. In this demo, we firstly present important techniques and implementation skills of the mining problem, secondly, we show the system architecture of UFIMT, thirdly, we report an empirical analysis on extensive both real and synthetic benchmark data sets, which are used to compare different algorithms and to show the close relationship between two different frequent itemset definitions, and finally we discuss some existing challenges and new findings.

【Keywords】: UFIMT; frequent itemset mining; uncertain database

191. Visual exploration of collaboration networks based on graph degeneracy.

Paper Link】 【Pages】:1512-1515

【Authors】: Christos Giatsidis ; Klaus Berberich ; Dimitrios M. Thilikos ; Michalis Vazirgiannis

【Abstract】: We demonstrate a system that supports the visual exploration of collaboration networks. The system leverages the notion of fractional cores introduced in earlier work to rank vertices in a collaboration network and filter vertices' neighborhoods. Fractional cores build on the idea of graph degeneracy as captured by the notion of k-cores in graph theory and extend it to undirected edge-weighted graphs. In a co-authorship network, for instance, the fractional core index of an author intuitively reflects the degree of collaboration with equally or higher-ranked authors. Our system has been deployed on a real-world co-authorship network derived from DBLP, demonstrating that the idea of fractional cores can be applied even to large-scale networks. The system provides an easy-to-use interface to query for the fractional core index of an author, to see who the closest equally or higher-ranked co-authors are, and explore the entire co-authorship network in an incremental manner.

【Keywords】: collaboration networks; fractional cores; graph degeneracy

192. TourViz: interactive visualization of connection pathways in large graphs.

Paper Link】 【Pages】:1516-1519

【Authors】: Duen Horng Chau ; Leman Akoglu ; Jilles Vreeken ; Hanghang Tong ; Christos Faloutsos

【Abstract】: We present TourViz, a system that helps its users to interactively visualize and make sense in large network datasets. In particular, it takes as input a set of nodes the user specifies as of interest and presents the user with a visualization of connection subgraphs around these input nodes. Each connection subgraph contains good pathways that highlight succinct connections among a "close-by" group of input nodes. TourViz combines visualization with rich user interaction to engage and help the user to further understand the relations among the nodes of interest,by exploring their neighborhood on demand as well as modifying the set of interest nodes. We demonstrate TourViz's usage and benefits using the DBLP graph, consisting of authors and their co-authorship relations, while our system is designed generally to work with any kind of graph data. We will invite the audience to experiment with our system and comment on its usability, usefulness, and how our system can help with their research and improve the understanding of data in other domains.

【Keywords】: connection subgraphs; large networks; sensemaking

193. D-INDEX: a web environment for analyzing dependences among scientific collaborators.

Paper Link】 【Pages】:1520-1523

【Authors】: Claudio Schifanella ; Luigi Di Caro ; Mario Cataldi ; Marie-Aude Aufaure

【Abstract】: In this work, we demonstrate a web application, available at http://d-index.di.unito.it, that permits to analyze the scientific profiles of all the researchers indexed by DBLP by focusing on the collaborations that contributed to define their curricula. The presented application allows the user to analyze the profile of a researcher, her dependence degrees on all the co-authors (along her entire scientific publication history) and to make comparisons among them in terms of dependence patterns. In particular, it is possible to estimate and visualize how much a researcher has benefited from collaboration with another researcher as well as the communities in which she has been involved. Moreover, the application permits to compare, in a single chart, each researcher with all the scientists indexed in DBLP by focusing on their dependences with respect to many other parameters like the total number of papers, the number of collaborations and the length of the scientific careers.

【Keywords】: DBLP; collaboration graph; scientometrics

194. Information propagation game: a tool to acquire humanplaying data for multiplayer influence maximization on social networks.

Paper Link】 【Pages】:1524-1527

【Authors】: Hung-Hsuan Chen ; Yan-Bin Ciou ; Shou-De Lin

【Abstract】: With the popularity of online social network services, influence maximization on social networks has drawn much attention in recent years. Most of these studies approximate a greedy based sub-optimal solution by proving the submodular nature of the utility function. Instead of using the analytical techniques, we are interested in solving the diffusion competition and influence maximization problem by a data-driven approach. We propose Information Propagation Game (IPG), a framework that can collect a large number of seed picking strategies for analysis. Through the IPG framework, human players are not only having fun but also helping contributing the seed picking strategies. Preliminary experiment suggests that centrality based heuristics are too simple for seed selection in a multiple player environment.

【Keywords】: diffusion network; game; independent cascade; influence maximization; information propagation; linear threshold

195. MoodLens: an emoticon-based sentiment analysis system for chinese tweets.

Paper Link】 【Pages】:1528-1531

【Authors】: Jichang Zhao ; Li Dong ; Junjie Wu ; Ke Xu

【Abstract】: Recent years have witnessed the explosive growth of online social media. Weibo, a Twitter-like online social network in China, has attracted more than 300 million users in less than three years, with more than 1000 tweets generated in every second. These tweets not only convey the factual information, but also reflect the emotional states of the authors, which are very important for understanding user behaviors. However, a tweet in Weibo is extremely short and the words it contains evolve extraordinarily fast. Moreover, the Chinese corpus of sentiments is still very small, which prevents the conventional keyword-based methods from being used. In light of this, we build a system called MoodLens, which to our best knowledge is the first system for sentiment analysis of Chinese tweets in Weibo. In MoodLens, 95 emoticons are mapped into four categories of sentiments, i.e. angry, disgusting, joyful, and sad, which serve as the class labels of tweets. We then collect over 3.5 million labeled tweets as the corpus and train a fast Naive Bayes classifier, with an empirical precision of 64.3%. MoodLens also implements an incremental learning method to tackle the problem of the sentiment shift and the generation of new words. Using MoodLens for real-time tweets obtained from Weibo, several interesting temporal and spatial patterns are observed. Also, sentiment variations are well captured by MoodLens to effectively detect abnormal events in China. Finally, by using the highly efficient Naive Bayes classifier, MoodLens is capable of online real-time sentiment monitoring. The demo of MoodLens can be found at http://goo.gl/8DQ65.

【Keywords】: chinese short text; online social media; sentiment analysis; weibo

196. Intelligent advertising framework for digital signage.

Paper Link】 【Pages】:1532-1535

【Authors】: Phil Tian ; Addicam V. Sanjay ; Kunapareddy Chiranjeevi ; Shahzad Malik Malik

【Abstract】: How to realize targeted advertising in digital signage is an interesting question. This paper proposed an Intelligent Advertising Framework (IAF), which pioneers the integration of Anonymous Viewer Analytics (AVA) and Data Mining technologies to achieve Targeted and interactive Advertising. IAF correlates AVA viewership information with point-of-sale (POS) data, and establishes a link between the response time to an ad by a certain demographic group and the effect on the sale of the advertised product. With the advertising models learned based on this correlation, IAF can provide advertisers and retailers with intelligence to show the right ads to right audience in right location at right time. Preliminary results indicate that IAF will greatly improve the effect and utility of advertising and maximize the Return on Investment (ROI) of advertisers and retailers. The demo shows Intel's leadership regarding intelligent advertising in the Digital Signage industry.

【Keywords】: anonymous viewer analytics; data mining; digital signage; intelligent advertising framework; targeted advertising

197. AssocExplorer: an association rule visualization system for exploratory data analysis.

Paper Link】 【Pages】:1536-1539

【Authors】: Guimei Liu ; Andre Suchitra ; Haojun Zhang ; Mengling Feng ; See-Kiong Ng ; Limsoon Wong

【Abstract】: We present a system called AssocExplorer to support exploratory data analysis via association rule visualization and exploration. AssocExplorer is designed by following the visual information-seeking mantra: overview first, zoom and filter, then details on demand. It effectively uses coloring to deliver information so that users can easily detect things that are interesting to them. If users find a rule interesting, they can explore related rules for further analysis, which allows users to find interesting phenomenon that are difficult to detect when rules are examined separately. Our system also allows users to compare rules and inspect rules with similar item composition but different statistics so that the key factors that contribute to the difference can be isolated.

【Keywords】: exploratory data analysis; rule exploration; rule visualization

198. GeoSearch: georeferenced video retrieval system.

Paper Link】 【Pages】:1540-1543

【Authors】: Youngwoo Kim ; Jinha Kim ; Hwanjo Yu

【Abstract】: Conventional video search systems, to find relevant videos, rely on textual data such as video titles, annotations, and text around the video. Nowadays, video recording devices such as ameras, smartphones and car blackboxes are equipped with GPS sensors and able to capture videos with spatiotemporal information such as time, location and camera direction. We call such videos georeferenced videos. This paper presents a georeferenced video retrieval system, geosearch, which efficiently retrieves videos containing a certain point or range in the map. To enable a fast search of georeferenced videos, geosearch adopts a novel data structure MBTR (Minimum Bounding Tilted Rectangle) in the leaf nodes of R-Tree. New algorithms are developed to build MBTRs from georeferenced videos and to efficiently process point and range queries on MBTRs. We demonstrate our system on real georeferenced videos, and show that, compared to previous methods, geosearch substantially reduces the index size and also improves the search speed for georeferenced video data. Our online demo is available at "http://dm.hwanjoyu.org/geosearch".

【Keywords】: georeferencing; spatial indexing; video search

199. Siren: an interactive tool for mining and visualizing geospatial redescriptions.

Paper Link】 【Pages】:1544-1547

【Authors】: Esther Galbrun ; Pauli Miettinen

【Abstract】: We present SIREN, an interactive tool for mining and visualizing geospatial redescriptions. Redescription mining is a powerful data analysis tool that aims at finding alternative descriptions of the same entities. For example, in biology, an important task is to identify the bioclimatic constraints that allow some species to survive, that is, to describe geographical regions in terms of both the fauna that inhabits them and their bioclimatic conditions. Using SIREN, users can explore geospatial data of their interest by visualizing the redescriptions on a map, interactively edit, extend and filter them. To demonstrate the use of the tool, we focus on climatic niche-finding over Europe, as an example task. Yet, SIREN is by no means limited to a particular dataset or application.

【Keywords】: interactive data mining; redescription mining

Paper Link】 【Pages】:1548-1551

【Authors】: Shamanth Kumar ; Fred Morstatter ; Grant Marshall ; Huan Liu ; Ullas Nambiar

【Abstract】: Recent years have seen an exponential increase in the number of users of social media sites. As the number of users of these sites continues to grow at an extraordinary rate, the amount of data produced follows in magnitude. With this deluge of social media data, the need for comprehensive tools to analyze user interactions is ever increasing. In this paper, we present a novel tool, Navigating Information Facets on Twitter (NIF-T), which helps users to explore data generated on social media sites. Using the three dimensions or facets: time, location, and topic as an example of the many possible facets, we enable the users to explore large social media datasets. With the help of a large corpus of tweets collected from the Occupy Wall Street movement on the Twitter platform we show how our system can be used to identify important aspects of the event along these facets.

【Keywords】: and visual analytics; faceted search; social media

201. HeteRecom: a semantic-based recommendation systemin heterogeneous networks.

Paper Link】 【Pages】:1552-1555

【Authors】: Chuan Shi ; Chong Zhou ; Xiangnan Kong ; Philip S. Yu ; Gang Liu ; Bai Wang

【Abstract】: Making accurate recommendations for users has become an important function of e-commerce system with the rapid growth of WWW. Conventional recommendation systems usually recommend similar objects, which are of the same type with the query object without exploring the semantics of different similarity measures. In this paper, we organize objects in the recommendation system as a heterogeneous network. Through employing a path-based relevance measure to evaluate the relatedness between any-typed objects and capture the subtle semantic containing in each path, we implement a prototype system (called HeteRecom) for semantic based recommendation. HeteRecom has the following unique properties: (1) It provides the semantic-based recommendation function according to the path specified by users. (2) It recommends the similar objects of the same type as well as related objects of different types. We demonstrate the effectiveness of our system with a real-world movie data set.

【Keywords】: heterogeneous information network; recommendation; semantic search; similarity

202. VOXSUP: a social engagement framework.

Paper Link】 【Pages】:1556-1559

【Authors】: Yusheng Xie ; Daniel Honbo ; Alok N. Choudhary ; Kunpeng Zhang ; Yu Cheng ; Ankit Agrawal

【Abstract】: Social media websites are currently central hubs on the Internet. Major online social media platforms are not only places for individual users to socialize but are increasingly more important as channels for companies to advertise, public figures to engage, etc. In order to optimize such advertising and engaging efforts, there is an emerging challenge for knowledge discovery on today's Internet. The goal of knowledge discovery is to understand the entire online social landscape instead of merely summarizing the statistics. To answer this challenge, we have created VOXSUP as a unified social engagement framework. Unlike most existing tools, VOXSUP not only aggregates and filters social data from the Internet, but also provides what we call Voxsupian Knowledge Discovery (VKD). VKD consists of an almost human-level understanding of social conversations at any level of granularity from a single comment sentiment to multi-lingual inter-platform user demographics. Here we describe the technologies that are crucial to VKD, and subsequently go beyond experimental verification and present case studies from our live VOXSUP system.

【Keywords】: opinion mining; social ranking; topic model

203. A system for extracting top-K lists from the web.

Paper Link】 【Pages】:1560-1563

【Authors】: Zhixian Zhang ; Kenny Qili Zhu ; Haixun Wang

【Abstract】: List data is an important source of structured data on the web. This paper is concerned with "top-k" pages, which are web pages that describe a list of k instances of a particular topic or concept. Examples include "the 10 tallest persons in the world" and "the 50 hits of 2010 you don't want to miss". Compared to normal web list data, "top-k" lists contain richer information and are easier to understand. Therefore the extraction of such lists can help enrich existing knowledge bases about general concepts, or act as a preprocessing step to produce facts for a fact answering engine. We present an efficient system that extracts the target lists from web pages with high accuracy. We have used the system to process up to 160 million, or 1/10 of a high-frequency web snapshot from Bing, and obtained over 140,000 lists with 90.4% precision.

【Keywords】: list extraction; top-k lists; web information extraction; web mining

204. EventSearch: a system for event discovery and retrieval on multi-type historical data.

Paper Link】 【Pages】:1564-1567

【Authors】: Dongdong Shan ; Wayne Xin Zhao ; Rishan Chen ; Baihan Shu ; Ziqi Wang ; Junjie Yao ; Hongfei Yan ; Xiaoming Li

【Abstract】: We present EventSearch, a system for event extraction and retrieval on four types of news-related historical data, i.e., Web news articles, newspapers, TV news program, and micro-blog short messages. The system incorporates over 11 million web pages extracted from "Web InfoMall", the Chinese Web Archive since 2001. The newspaper and TV news video clips also span from 2001 to 2011. The system, upon a user query, returns a list of event snippets from multiple data sources. A novel burst model is used to discover events from time-stamped texts. In addition to offline event extraction, our system also provides online event extraction to further meet the user needs. EventSearch provides meaningful analytics that synthesize an accurate description of events. Users interact with the system by ranking the identified events using different criteria (scale, recency and relevance) and submitting their own information needs in different input fields.

【Keywords】: event detection; event search

205. EvaPlanner: an evacuation planner with social-based flocking kinetics.

Paper Link】 【Pages】:1568-1571

【Authors】: Cheng-Te Li ; Shou-De Lin

【Abstract】: This paper demonstrates a system that exploits graph mining, social network analysis, and agent-based crowd simulation techniques to investigate the evacuation dynamics during fire emergency. We create a novel evacuation planning system, EvaPlanner, to deal with three tasks. First, the system identifies the preferable locations to establish the exits to facilitate efficient evacuation from the dangerous areas. Second, it determines the most effective positions to place the emergency signs such that panic crowd can quickly find the exits. Third, it faithfully simulates the evacuation dynamics of crowd considering not only the individual movement kinetics but also the social connections between people. EvaPlanner provides a flexible experimental platform for investigating the evacuation dynamics under a variety of settings, and can further be utilized for animation and movie production. In addition, it can serve as a tool to assist architects address the safety concern during the planning phase. The demo system can be found in the link: http://mslab.csie.ntu.edu.tw/evaplanner/

【Keywords】: crowd simulation; evacuation planning; social network

Paper Link】 【Pages】:1572-1575

【Authors】: Jinoh Oh ; Taehoon Kim ; Sun Park ; Hwanjo Yu

【Abstract】: Exploring PubMed to find relevant information is challenging and time-consuming because PubMed typically returns a long list of articles as a result of query. Semantic network helps users to explore a large document collection and to capture key concepts and relationships among the concepts. The semantic network also serves to broaden the user's knowledge and extend query keyword by detecting and visualizing new related concepts or relations hidden in the retrieved documents. The problem of existing semantic network techniques is that they typically produce many redundant relationships, which prevents users from quickly capturing the underlying relationships among concepts. This paper develops an online PubMed search system, which displays semantic networks having no redundant relationships in real-time as a result of query. To do so, we propose an efficient semantic network construction algorithm, which prevents producing redundant relationships during the network construction. Our extensive experiments on actual PubMed data show that the proposed method is significantly faster than the method removing redundant relationships afterward. Our method is implemented and integrated into a relevance feedback PubMed search engine, called RefMed, "http://dm.postech.ac.kr/refmed", and will be demonstrated through the website.

【Keywords】: PubMed; information retrieval; semantic network construction