17. KDD 2011:San Diego, CA, USA

Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011. ACM 【DBLP Link】

Paper Num: 177 || Session Num: 30

Classification 4
Demonstration track 11
Deployed 1 4
Deployed 2 4
Discovery 4
Emerging 1 5
Emerging 2 5
Emerging 3 4
Frequent sets 3
Graph analysis 4
Graph mining 3
Industrial practice expo track 10
Keynote address 1 1
Keynote address 2 1
Keynote address 3 1
Keynote address 4 1
Matrix factorization 4
Online data and streams 4
Poster session 70
Predictive modeling 3
Privacy 3
Scalability 3
Social networks 3
Text mining 1 3
Text mining 2 3
Text mining 3 3
Theory 3
Unsupervised learning 3
User modeling 3
Web user modeling 4

Keynote address 1 1

1. Convex optimization: from embedded real-time to large-scale distributed.

【Paper Link】【Pages】:1

【Authors】: Stephen Boyd

【Abstract】: Convex optimization has emerged as useful tool for applications that include data analysis and model fitting, resource allocation, engineering design, network design and optimization, finance, and control and signal processing. After an overview, the talk will focus on two extremes: real-time embedded convex optimization, and distributed convex optimization. Code generation can be used to generate extremely efficient and reliable solvers for small problems, that can execute in milliseconds or microseconds, and are ideal for embedding in real-time systems. At the other extreme, we describe methods for large-scale distributed optimization, which coordinate many solvers to solve enormous problems.

【Keywords】: convex optimization

Keynote address 2 1

2. Internet scale data analysis.

【Paper Link】【Pages】:2

【Authors】: Peter Norvig

【Abstract】: This talk covers techniques for analyzing data sets with up to trillions of examples with billions of features, using thousands of computers. To operate at this scale requires an understanding of an increasing complex hardware hierarchy (e.g. cache, memory, SSD, another machine in the rack, disk, a machine in another data center, ...); a model for recovering from inevitable hardware and software failures; a machine learning model that allows for efficient computation over large, continuously updated data sets; and a way to visualize and share the results.

【Keywords】: large scale data analysis

Keynote address 3 1

3. Cancer genomics.

【Paper Link】【Pages】:3-4

【Authors】: David Haussler

【Abstract】: Throughout life, the cells in every individual accumulate many changes in the DNA inherited from his or her parents. Certain combinations of changes lead to cancer. During the last decade, the cost of DNA sequencing has been dropping by a factor of 10 every two years, making it now possible to read most of the three billion base genome from a patient's cancer tumor, and to try to determine all of the thousands of DNA changes in it. Under the auspices of NCI's Cancer Genome Atlas Project, 10,000 tumors will be sequenced in this manner in the next few years. Soon cancer genome sequencing will be a widespread clinical practice, and millions of tumors will be sequenced. A massive computational problem looms in interpreting these data. First, because we can only read short pieces of DNA, we have the enormous problem of assembling a coherent and reliable representation of the tumor genome from massive amounts of incomplete and error-prone evidence. This is the first challenge. Second, every human genome is unique from birth, and every tumor a unique variant. There is no single route to cancer. We must learn to read the varied signatures of cancer within the tumor genome and associate these with optimal treatments. Already there are hundreds of molecularly targeted treatments for cancer available, each known to be more or less effective depending on specific genetic variants. However, targeting a single gene with one treatment rarely works. The second challenge is to tackle the combinatorics of personalized, targeted, combination therapy in cancer.

【Keywords】: cancer

Keynote address 4 1

4. The mathematics of causal inference.

【Paper Link】【Pages】:5

【Authors】: Judea Pearl

【Abstract】: I will review concepts, principles, and mathematical tools that were found useful in applications involving causal and counterfactual relationships. This semantical framework, enriched with a few ideas from logic and graph theory, gives rise to a complete, coherent, and friendly calculus of causation that unifies the graphical and counterfactual approaches to causation and resolves many long-standing problems in several of the sciences. These include questions of causal effect estimation, policy analysis, and the integration of data from diverse studies. Of special interest to KDD researchers would be the following topics: The Mediation Formula, and what it tells us about direct and indirect effects. What mathematics can tell us about "external validity" or "generalizing from experiments" What can graph theory tell us about recovering from sample-selection bias.

【Keywords】: causal inference

Classification 4

5. CHIRP: a new classifier based on composite hypercubes on iterated random projections.

【Paper Link】【Pages】:6-14

【Authors】: Leland Wilkinson ; Anushka Anand ; Dang Tuan Nhon

【Abstract】: We introduce a classifier based on the L-infinity norm. This classifier, called CHIRP, is an iterative sequence of three stages (projecting, binning, and covering) that are designed to deal with the curse of dimensionality, computational complexity, and nonlinear separability. CHIRP is not a hybrid or modification of existing classifiers; it employs a new covering algorithm. The accuracy of CHIRP on widely-used benchmark datasets exceeds the accuracy of competitors. Its computational complexity is sub-linear in number of instances and number of variables and subquadratic in number of classes.

【Keywords】: random projections; supervised classification

6. Supervised learning for provenance-similarity of binaries.

【Paper Link】【Pages】:15-23

【Authors】: Sagar Chaki ; Cory Cohen ; Arie Gurfinkel

【Abstract】: Understanding, measuring, and leveraging the similarity of binaries (executable code) is a foundational challenge in software engineering. We present a notion of similarity based on provenance -- two binaries are similar if they are compiled from the same (or very similar) source code with the same (or similar) compilers. Empirical evidence suggests that provenance-similarity accounts for a significant portion of variation in existing binaries, particularly in malware. We propose and evaluate the applicability of classification to detect provenance-similarity. We evaluate a variety of classifiers, and different types of attributes and similarity labeling schemes, on two benchmarks derived from open-source software and malware respectively. We present encouraging results indicating that classification is a viable approach for automated provenance-similarity detection, and as an aid for malware analysts in particular.

【Keywords】: binary similarity; classification; software provenance

7. Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classification.

【Paper Link】【Pages】:24-32

【Authors】: Zhuang Wang ; Nemanja Djuric ; Koby Crammer ; Slobodan Vucetic

【Abstract】: Support Vector Machines (SVMs) are among the most popular and successful classification algorithms. Kernel SVMs often reach state-of-the-art accuracies, but suffer from the curse of kernelization due to linear model growth with data size on noisy data. Linear SVMs have the ability to efficiently learn from truly large data, but they are applicable to a limited number of domains due to low representational power. To fill the representability and scalability gap between linear and nonlinear SVMs, we propose the Adaptive Multi-hyperplane Machine (AMM) algorithm that accomplishes fast training and prediction and has capability to solve nonlinear classification problems. AMM model consists of a set of hyperplanes (weights), each assigned to one of the multiple classes, and predicts based on the associated class of the weight that provides the largest prediction. The number of weights is automatically determined through an iterative algorithm based on the stochastic gradient descent algorithm which is guaranteed to converge to a local optimum. Since the generalization bound decreases with the number of weights, a weight pruning mechanism is proposed and analyzed. The experiments on several large data sets show that AMM is nearly as fast during training and prediction as the state-of-the-art linear SVM solver and that it can be orders of magnitude faster than kernel SVM. In accuracy, AMM is somewhere between linear and kernel SVMs. For example, on an OCR task with 8 million highly dimensional training examples, AMM trained in 300 seconds on a single-core processor had 0.54% error rate, which was significantly lower than 2.03% error rate of a linear SVM trained in the same time and comparable to 0.43% error rate of a kernel SVM trained in 2 days on 512 processors. The results indicate that AMM could be an attractive option when solving large-scale classification problems. The software is available at www.dabi.temple.edu/~vucetic/AMM.html.

【Keywords】: large-scale learning; nonlinear classification; stochastic gradient descent; support vector machines

8. An improved GLMNET for l1-regularized logistic regression.

【Paper Link】【Pages】:33-41

【Authors】: Guo-Xun Yuan ; Chia-Hua Ho ; Chih-Jen Lin

【Abstract】: GLMNET proposed by Friedman et al. is an algorithm for generalized linear models with elastic net. It has been widely applied to solve L1-regularized logistic regression. However, recent experiments indicated that the existing GLMNET implementation may not be stable for large-scale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efficient regardless of loosely or strictly solving the optimization problem. Experiments demonstrate that the improved GLMNET is more efficient than a state-of-the-art coordinate descent method.

【Keywords】: l1 regularization; linear classification; logistic regression

Matrix factorization 4

9. Integrating low-rank and group-sparse structures for robust multi-task learning.

【Paper Link】【Pages】:42-50

【Authors】: Jianhui Chen ; Jiayu Zhou ; Jieping Ye

【Abstract】: Multi-task learning (MTL) aims at improving the generalization performance by utilizing the intrinsic relationships among multiple related tasks. A key assumption in most MTL algorithms is that all tasks are related, which, however, may not be the case in many real-world applications. In this paper, we propose a robust multi-task learning (RMTL) algorithm which learns multiple tasks simultaneously as well as identifies the irrelevant (outlier) tasks. Specifically, the proposed RMTL algorithm captures the task relationships using a low-rank structure, and simultaneously identifies the outlier tasks using a group-sparse structure. The proposed RMTL algorithm is formulated as a non-smooth convex (unconstrained) optimization problem. We propose to adopt the accelerated proximal method (APM) for solving such an optimization problem. The key component in APM is the computation of the proximal operator, which can be shown to admit an analytic solution. We also theoretically analyze the effectiveness of the RMTL algorithm. In particular, we derive a key property of the optimal solution to RMTL; moreover, based on this key property, we establish a theoretical bound for characterizing the learning performance of RMTL. Our experimental results on benchmark data sets demonstrate the effectiveness and efficiency of the proposed algorithm.

【Keywords】: group-sparsity; low-rank patterns; multi-task learning; robust

10. Model order selection for boolean matrix factorization.

【Paper Link】【Pages】:51-59

【Authors】: Pauli Miettinen ; Jilles Vreeken

【Abstract】: Matrix factorizations---where a given data matrix is approximated by a product of two or more factor matrices---are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the `model order selection problem' of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)---where data, factors, and matrix product are Boolean---has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general---making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

【Keywords】: boolean matrix factorizations; matrix decompositions; matrix factorizations; minimum description length principle; model order selection; model selection

11. Rank aggregation via nuclear norm minimization.

【Paper Link】【Pages】:60-68

【Authors】: David F. Gleich ; Lek-Heng Lim

【Abstract】: The process of rank aggregation is intimately intertwined with the structure of skew symmetric matrices. We apply recent advances in the theory and algorithms of matrix completion to skew-symmetric matrices. This combination of ideas produces a new method for ranking a set of items. The essence of our idea is that a rank aggregation describes a partially filled skew-symmetric matrix. We extend an algorithm for matrix completion to handle skew-symmetric data and use that to extract ranks for each item. Our algorithm applies to both pairwise comparison and rating data. Because it is based on matrix completion, it is robust to both noise and incomplete data. We show a formal recovery result for the noiseless case and present a detailed study of the algorithm on synthetic data and Netflix ratings.

【Keywords】: nuclear norm; rank aggregation; skew symmetric

12. Large-scale matrix factorization with distributed stochastic gradient descent.

【Paper Link】【Pages】:69-77

【Authors】: Rainer Gemulla ; Erik Nijkamp ; Peter J. Haas ; Yannis Sismanis

【Abstract】: We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. We first develop a novel "stratified" SGD variant (SSGD) that applies to general loss-minimization problems in which the loss function can be expressed as a weighted sum of "stratum losses." We establish sufficient conditions for convergence of SSGD using results from stochastic approximation theory and regenerative process theory. We then specialize SSGD to obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations. We describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms.

【Keywords】: distributed matrix factorization; mapreduce; recommendation system; stochastic gradient descent

Graph analysis 4

13. Diversity in ranking via resistive graph centers.

【Paper Link】【Pages】:78-86

【Authors】: Avinava Dubey ; Soumen Chakrabarti ; Chiranjib Bhattacharyya

【Abstract】: Users can rarely reveal their information need in full detail to a search engine within 1--2 words, so search engines need to "hedge their bets" and present diverse results within the precious 10 response slots. Diversity in ranking is of much recent interest. Most existing solutions estimate the marginal utility of an item given a set of items already in the response, and then use variants of greedy set cover. Others design graphs with the items as nodes and choose diverse items based on visit rates (PageRank). Here we introduce a radically new and natural formulation of diversity as finding centers in resistive graphs. Unlike in PageRank, we do not specify the edge resistances (equivalently, conductances) and ask for node visit rates. Instead, we look for a sparse set of center nodes so that the effective conductance from the center to the rest of the graph has maximum entropy. We give a cogent semantic justification for turning PageRank thus on its head. In marked deviation from prior work, our edge resistances are learnt from training data. Inference and learning are NP-hard, but we give practical solutions. In extensive experiments with subtopic retrieval, social network search, and document summarization, our approach convincingly surpasses recently-published diversity algorithms like subtopic cover, max-marginal relevance (MMR), Grasshopper, DivRank, and SVMdiv.

【Keywords】: conductance; diversity; graph; ranking

14. Collective graph identification.

【Paper Link】【Pages】:87-95

【Authors】: Galileo Namata ; Stanley Kok ; Lise Getoor

【Abstract】: Data describing networks (communication networks, transaction networks, disease transmission networks, collaboration networks, etc.) is becoming increasingly ubiquitous. While this observational data is useful, it often only hints at the actual underlying social or technological structures which give rise to the interactions. For example, an email communication network provides useful insight but is not the same as the "real" social network among individuals. In this paper, we introduce the problem of graph identification, i.e., the discovery of the true graph structure underlying an observed network. We cast the problem as a probabilistic inference task, in which we must infer the nodes, edges, and node labels of a hidden graph, based on evidence provided by the observed network. This in turn corresponds to the problems of performing entity resolution, link prediction, and node labeling to infer the hidden graph. While each of these problems have been studied separately, they have never been considered together as a coherent task. We present a simple yet novel approach to address all three problems simultaneously. Our approach, called C3, consists of Coupled Collective Classifiers that are iteratively applied to propagate information among solutions to the problems. We empirically demonstrate that C3 is superior, in terms of both predictive accuracy and runtime, to state-of-the-art probabilistic approaches on three real-world problems.

【Keywords】: collective classification; entity resolution; link prediction; semi-supervised learning

15. Semi-supervised ranking on very large graphs with rich metadata.

【Paper Link】【Pages】:96-104

【Authors】: Bin Gao ; Tie-Yan Liu ; Wei Wei ; Taifeng Wang ; Hang Li

【Abstract】: Graph ranking plays an important role in many applications, such as page ranking on web graphs and entity ranking on social networks. In applications, besides graph structure, rich information on nodes and edges and explicit or implicit human supervision are often available. In contrast, conventional algorithms (e.g., PageRank and HITS) compute ranking scores by only resorting to graph structure information. A natural question arises here, that is, how to effectively and efficiently leverage all the information to more accurately calculate graph ranking scores than the conventional algorithms, assuming that the graph is also very large. Previous work only partially tackled the problem, and the proposed solutions are also not satisfying. This paper addresses the problem and proposes a general framework as well as an efficient algorithm for graph ranking. Specifically, we define a semi-supervised learning framework for ranking of nodes on a very large graph and derive within our proposed framework an efficient algorithm called Semi-Supervised PageRank. In the algorithm, the objective function is defined based upon a Markov random walk on the graph. The transition probability and the reset probability of the Markov model are defined as parametric models based on features on nodes and edges. By minimizing the objective function, subject to a number of constraints derived from supervision information, we simultaneously learn the optimal parameters of the model and the optimal ranking scores of the nodes. Finally, we show that it is possible to make the algorithm efficient to handle a billion-node graph by taking advantage of the sparsity of the graph and implement it in the MapReduce logic. Experiments on real data from a commercial search engine show that the proposed algorithm can outperform previous algorithms on several tasks.

【Keywords】: mapreduce; page importance; pagerank

16. Benefits of bias: towards better characterization of network sampling.

【Paper Link】【Pages】:105-113

【Authors】: Arun S. Maiya ; Tanya Y. Berger-Wolf

【Abstract】: From social networks to P2P systems, network sampling arises in many settings. We present a detailed study on the nature of biases in network sampling strategies to shed light on how best to sample from networks. We investigate connections between specific biases and various measures of structural representativeness. We show that certain biases are, in fact, beneficial for many applications, as they "push" the sampling process towards inclusion of desired properties. Finally, we describe how these sampling biases can be exploited in several, real-world applications including disease outbreak detection and market research.

【Keywords】: bias; complex networks; crawling; graph mining; link mining; online sampling; sampling; social network analysis

Web user modeling 4

17. Scalable distributed inference of dynamic user interests for behavioral targeting.

【Paper Link】【Pages】:114-122

【Authors】: Amr Ahmed ; Yucheng Low ; Mohamed Aly ; Vanja Josifovski ; Alexander J. Smola

【Abstract】: Historical user activity is key for building user profiles to predict the user behavior and affinities in many web applications such as targeting of online advertising, content personalization and social recommendations. User profiles are temporal, and changes in a user's activity patterns are particularly useful for improved prediction and recommendation. For instance, an increased interest in car-related web pages may well suggest that the user might be shopping for a new vehicle.In this paper we present a comprehensive statistical framework for user profiling based on topic models which is able to capture such effects in a fully \emph{unsupervised} fashion. Our method models topical interests of a user dynamically where both the user association with the topics and the topics themselves are allowed to vary over time, thus ensuring that the profiles remain current. We describe a streaming, distributed inference algorithm which is able to handle tens of millions of users. Our results show that our model contributes towards improved behavioral targeting of display advertising relative to baseline models that do not incorporate topical and/or temporal dependencies. As a side-effect our model yields human-understandable results which can be used in an intuitive fashion by advertisers.

【Keywords】: computational advertising; distributed inference; large-scale; online inference; user modeling

18. Multiple domain user personalization.

【Paper Link】【Pages】:123-131

【Authors】: Yucheng Low ; Deepak Agarwal ; Alexander J. Smola

【Abstract】: Content personalization is a key tool in creating attractive websites. Synergies can be obtained by integrating personalization between several Internet properties. In this paper we propose a hierarchical Bayesian model to address these issues. Our model allows the integration of multiple properties without changing the overall structure, which makes it easily extensible across large Internet portals. It relies at its lowest level on Latent Dirichlet Allocation, while making use of latent side features for cross-property integration. We demonstrate the efficiency of our approach by analyzing data from several properties of a major Internet portal.

【Keywords】: domain integration; latent dirichlet allocation; parallel statistical inference; user profiling

19. Click shaping to optimize multiple objectives.

【Paper Link】【Pages】:132-140

【Authors】: Deepak Agarwal ; Bee-Chung Chen ; Pradheep Elango ; Xuanhui Wang

【Abstract】: Recommending interesting content to engage users is important for web portals (e.g. AOL, MSN, Yahoo!, and many others). Existing approaches typically recommend articles to optimize for a single objective, i.e., number of clicks. However a click is only the starting point of a user's journey and subsequent downstream utilities such as time-spent and revenue are important. In this paper, we call the problem of recommending links to jointly optimize for clicks and post-click downstream utilities click shaping. We propose a multi-objective programming approach in which multiple objectives are modeled in a constrained optimization framework. Such a formulation can naturally incorporate various application-driven requirements. We study several variants that model different requirements as constraints and discuss some of the subtleties involved. We conduct our experiments on a large dataset from a real system by using a newly proposed unbiased evaluation methodology [17]. Through extensive experiments we quantify the tradeoff between different objectives under various constraints. Our experimental results show interesting characteristics of different formulations and our findings may provide valuable guidance to the design of recommendation engines for web portals.

【Keywords】: click shaping; constrained optimization; multi-objective

20. Response prediction using collaborative filtering with hierarchies and side-information.

【Paper Link】【Pages】:141-149

【Authors】: Aditya Krishna Menon ; Krishna Prasad Chitrapura ; Sachin Garg ; Deepak Agarwal ; Nagaraj Kota

【Abstract】: In online advertising, response prediction is the problem of estimating the probability that an advertisement is clicked when displayed on a content publisher's webpage. In this paper, we show how response prediction can be viewed as a problem of matrix completion, and propose to solve it using matrix factorization techniques from collaborative filtering (CF). We point out the two crucial differences between standard CF problems and response prediction, namely the requirement of predicting probabilities rather than scores, and the issue of confidence in matrix entries. We address these issues using a matrix factorization analogue of logistic regression, and by applying a principled confidence-weighting scheme to its objective. We show how this factorization can be seamlessly combined with explicit features or side-information for pages and ads, which let us combine the benefits of both approaches. Finally, we combat the extreme sparsity of response prediction data by incorporating hierarchical information about the pages and ads into our factorization model. Experiments on three very large real-world datasets show that our model outperforms current state-of-the-art methods for response prediction.

【Keywords】: collaborative filtering; hierarchies; response prediction

User modeling 3

21. From bias to opinion: a transfer-learning approach to real-time sentiment analysis.

【Paper Link】【Pages】:150-158

【Authors】: Pedro Henrique Calais Guerra ; Adriano Veloso ; Wagner Meira Jr. ; Virgilio Almeida

【Abstract】: Real-time interaction, which enables live discussions, has become a key feature of most Web applications. In such an environment, the ability to automatically analyze user opinions and sentiments as discussions develop is a powerful resource known as real time sentiment analysis. However, this task comes with several challenges, including the need to deal with highly dynamic textual content that is characterized by changes in vocabulary and its subjective meaning and the lack of labeled data needed to support supervised classifiers. In this paper, we propose a transfer learning strategy to perform real time sentiment analysis. We identify a task - opinion holder bias prediction - which is strongly related to the sentiment analysis task; however, in constrast to sentiment analysis, it builds accurate models since the underlying relational data follows a stationary distribution. Instead of learning textual models to predict content polarity (i.e., the traditional sentiment analysis approach), we first measure the bias of social media users toward a topic, by solving a relational learning task over a network of users connected by endorsements (e.g., retweets in Twitter). We then analyze sentiments by transferring user biases to textual features. This approach works because while new terms may arise and old terms may change their meaning, user bias tends to be more consistent over time as a basic property of human behavior. Thus, we adopted user bias as the basis for building accurate classification models. We applied our model to posts collected from Twitter on two topics: the 2010 Brazilian Presidential Elections and the 2010 season of Brazilian Soccer League. Our results show that knowing the bias of only 10% of users generates an F1 accuracy level ranging from 80% to 90% in predicting user sentiment in tweets.

【Keywords】: opinion mining; relational learning; sentiment analysis; social media; transfer learning

22. User reputation in a comment rating environment.

【Paper Link】【Pages】:159-167

【Authors】: Bee-Chung Chen ; Jian Guo ; Belle L. Tseng ; Jie Yang

【Abstract】: Reputable users are valuable assets of a web site. We focus on user reputation in a comment rating environment, where users make comments about content items and rate the comments of one another. Intuitively, a reputable user posts high quality comments and is highly rated by the user community. To our surprise, we find that the quality of a comment judged editorially is almost uncorrelated with the ratings that it receives, but can be predicted using standard text features, achieving accuracy as high as the agreement between two editors! However, extracting a pure reputation signal from ratings is difficult because of data sparseness and several confounding factors in users' voting behavior. To address these issues, we propose a novel bias-smoothed tensor model and empirically show that our model significantly outperforms a number of alternatives based on Yahoo! News, Yahoo! Buzz and Epinions datasets.

【Keywords】: bias removal; hierarchical smoothing; latent factor model; tensor factorization; text quality

23. Selecting a comprehensive set of reviews.

【Paper Link】【Pages】:168-176

【Authors】: Panayiotis Tsaparas ; Alexandros Ntoulas ; Evimaria Terzi

【Abstract】: Online user reviews play a central role in the decision-making process of users for a variety of tasks, ranging from entertainment and shopping to medical services. As user-generated reviews proliferate, it becomes critical to have a mechanism for helping the users (information consumers) deal with the information overload, and presenting them with a small comprehensive set of reviews that satisfies their information need. This is particularly important for mobile phone users, who need to make decisions quickly, and have a device with limited screen real-estate for displaying the reviews. Previous approaches have addressed the problem by ranking reviews according to their (estimated) helpfulness. However, such approaches do not account for the fact that the top few high-quality reviews may be highly redundant, repeating the same information, or presenting the same positive (or negative) perspective. In this work, we focus on the problem of selecting a comprehensive set of few high-quality reviews that cover many different aspects of the reviewed item. We formulate the problem as a maximum coverage problem, and we present a generic formalism that can model the different variants of review-set selection. We describe algorithms for the different variants we consider, and, whenever possible, we provide approximation guarantees with respect to the optimal solution. We also perform an experimental evaluation on real data in order to understand the value of coverage for users.

【Keywords】: greedy algorithms; review selection; set cover

Online data and streams 4

24. Enabling fast prediction for ensemble models on data streams.

【Paper Link】【Pages】:177-185

【Authors】: Peng Zhang ; Jun Li ; Peng Wang ; Byron J. Gao ; Xingquan Zhu ; Li Guo

【Abstract】: Ensemble learning has become a common tool for data stream classification, being able to handle large volumes of stream data and concept drifting. Previous studies focus on building accurate prediction models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can automatically update themselves by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Experiments on both synthetic and real-world data streams demonstrate the performance of our approach.

【Keywords】: concept drifting; ensemble learning; spatial indexing; stream classification; stream data mining

25. Online active inference and learning.

【Paper Link】【Pages】:186-194

【Authors】: Josh Attenberg ; Foster J. Provost

【Abstract】: We present a generalized framework for active inference, the selective acquisition of labels for cases at prediction time in lieu of using the estimated labels of a predictive model. We develop techniques within this framework for classifying in an online setting, for example, for classifying the stream of web pages where online advertisements are being served. Stream applications present novel complications because (i) at the time of label acquisition, we don't know the set of instances that we will eventually see, (ii) instances repeat based on some unknown (and possibly skewed) distribution. We combine ideas from decision theory, cost-sensitive learning, and online density estimation. We also introduce a method for on-line estimation of the utility distribution, which allows us to manage the budget over the stream. The resulting model tells which instances to label so that by the end of each budget period, the budget is best spent (in expectation). The main results show that: (1) our proposed approach to active inference on streams can indeed reduce error costs substantially over alternative approaches, (2) more sophisticated online estimations achieve larger reductions in error. We next discuss simultaneously conducting active inference and active learning. We show that our expected-utility active inference strategy also selects good examples for learning. We close by pointing out that our utility-distribution estimation strategy can also be applied to convert pool-based active learning techniques into budget-sensitive online active learning techniques.

【Keywords】: active inference; active learning; machine learning; micro-outsourcing; on-line advertising

26. Unbiased online active learning in data streams.

【Paper Link】【Pages】:195-203

【Authors】: Wei Chu ; Martin Zinkevich ; Lihong Li ; Achint Thomas ; Belle L. Tseng

【Abstract】: Unlabeled samples can be intelligently selected for labeling to minimize classification error. In many real-world applications, a large number of unlabeled samples arrive in a streaming manner, making it impossible to maintain all the data in a candidate pool. In this work, we focus on binary classification problems and study selective labeling in data streams where a decision is required on each sample sequentially. We consider the unbiasedness property in the sampling process, and design optimal instrumental distributions to minimize the variance in the stochastic process. Meanwhile, Bayesian linear classifiers with weighted maximum likelihood are optimized online to estimate parameters. In empirical evaluation, we collect a data stream of user-generated comments on a commercial news portal in 30 consecutive days, and carry out offline evaluation to compare various sampling strategies, including unbiased active learning, biased variants, and random sampling. Experimental results verify the usefulness of online active learning, especially in the non-stationary situation with concept drift.

【Keywords】: active learning; adaptive importance sampling; bayesian online learning; data streaming; unbiasedness

27. Learning to trade off between exploration and exploitation in multiclass bandit prediction.

【Paper Link】【Pages】:204-212

【Authors】: Hamed Valizadegan ; Rong Jin ; Shijun Wang

【Abstract】: We study multi-class bandit prediction, an online learning problem where the learner only receives a partial feedback in each trial indicating whether the predicted class label is correct. The exploration vs. exploitation tradeoff strategy is a well-known technique for online learning with incomplete feedback (i.e., bandit setup). Banditron [8], a multi-class online learning algorithm for bandit setting, maximizes the run-time gain by balancing between exploration and exploitation with a fixed tradeoff parameter. The performance of Banditron can be quite sensitive to the choice of the tradeoff parameter and therefore effective algorithms to automatically tune this parameter is desirable. In this paper, we propose three learning strategies to automatically adjust the tradeoff parameter for Banditron. Our extensive empirical study with multiple real-world data sets verifies the efficacy of the proposed approach in learning the exploration vs. exploitation tradeoff parameter.

【Keywords】: bandit feedback; exploration vs. exploitation; multi-class classification; online learning

Deployed 1 4

28. Linear scale semantic mining algorithms in microsoft SQL server's semantic platform.

【Paper Link】【Pages】:213-221

【Authors】: Kunal Mukerjee ; Todd Porter ; Sorin Gherman

【Abstract】: This paper describes three linear scale, incremental, and fully automatic semantic mining algorithms that are at the foundation of the new Semantic Platform being released in the next version of SQL Server. The target workload is large (10 -- 100 million) Enterprise document corpuses. At these scales, anything short of linear scale and incremental is costly to deploy. These three algorithms give rise to three weighted physical indexes: Tag Index (top keywords in each document); Document Similarity Index (top closely related documents given any document); and Semantic Phrase Similarity Index (top semantically related phrases, given any phrase), which are then query-able through the SQL interface. The need for specifically creating these three indexes was motivated by observing typical stages of document research, and gap analysis, given current tools and technology at the Enterprise. We describe the mining algorithms and architecture, and also outline some compelling user experiences that are enabled by the indexes.

【Keywords】: document similarity; incremental; keyword extraction; linear scale; semantic mining; semantic platform

29. Combining file content and file relations for cloud based malware detection.

【Paper Link】【Pages】:222-230

【Authors】: Yanfang Ye ; Tao Li ; Shenghuo Zhu ; Weiwei Zhuang ; Egemen Tas ; Umesh Gupta ; Melih Abdulhayoglu

【Abstract】: Due to their damages to Internet security, malware (such as virus, worms, trojans, spyware, backdoors, and rootkits) detection has caught the attention not only of anti-malware industry but also of researchers for decades. Resting on the analysis of file contents extracted from the file samples, like Application Programming Interface (API) calls, instruction sequences, and binary strings, data mining methods such as Naive Bayes and Support Vector Machines have been used for malware detection. However, besides file contents, relations among file samples, such as a "Downloader" is always associated with many Trojans, can provide invaluable information about the properties of file samples. In this paper, we study how file relations can be used to improve malware detection results and develop a file verdict system (named "Valkyrie") building on a semi-parametric classifier model to combine file content and file relations together for malware detection. To the best of our knowledge, this is the first work of using both file content and file relations for malware detection. A comprehensive experimental study on a large collection of PE files obtained from the clients of anti-malware products of Comodo Security Solutions Incorporation is performed to compare various malware detection approaches. Promising experimental results demonstrate that the accuracy and efficiency of our Valkyrie system outperform other popular anti-malware software tools such as Kaspersky AntiVirus and McAfee VirusScan, as well as other alternative data mining based detection systems.

【Keywords】: cloud based malware detection; file content; file relation; semi-parametric model for learning from graph

30. High-precision phrase-based document classification on a modern scale.

【Paper Link】【Pages】:231-239

【Authors】: Ron Bekkerman ; Matan Gavish

【Abstract】: We present a document classification system that employs lazy learning from labeled phrases, and argue that the system can be highly effective whenever the following property holds: most of information on document labels is captured in phrases. We call this property near sufficiency. Our research contribution is twofold: (a) we quantify the near sufficiency property using the Information Bottleneck principle and show that it is easy to check on a given dataset; (b) we reveal that in all practical cases---from small-scale to very large-scale---manual labeling of phrases is feasible: the natural language constrains the number of common phrases composed of a vocabulary to grow linearly with the size of the vocabulary. Both these contributions provide firm foundation to applicability of the phrase-based classification (PBC) framework to a variety of large-scale tasks. We deployed the PBC system on the task of job title classification, as a part of LinkedIn's data standardization effort. The system significantly outperforms its predecessor both in terms of precision and coverage. It is currently being used in LinkedIn's ad targeting product, and more applications are being developed. We argue that PBC excels in high explainability of the classification results, as well as in low development and low maintenance costs. We benchmark PBC against existing high-precision document classification algorithms and conclude that it is most useful in multilabel classification.

【Keywords】: high-precision classification; large-scale classification; multilabel text classification

31. Activity analysis based on low sample rate smart meters.

【Paper Link】【Pages】:240-248

【Authors】: Feng Chen ; Jing Dai ; Bingsheng Wang ; Sambit Sahu ; Milind R. Naphade ; Chang-Tien Lu

【Abstract】: Activity analysis disaggregates utility consumption from smart meters into specific usage that associates with human activities. It can not only help residents better manage their consumption for sustainable lifestyle, but also allow utility managers to devise conservation programs. Existing research efforts on disaggregating consumption focus on analyzing consumption features with high sample rates (mainly between 1 Hz ~ 1MHz). However, many smart meter deployments support sample rates at most 1/900 Hz, which challenges activity analysis with occurrences of parallel activities, difficulty of aligning events, and lack of consumption features. We propose a novel statistical framework for disaggregation on coarse granular smart meter readings by modeling fixture characteristics, household behavior, and activity correlations. This framework has been implemented into two approaches for different application scenarios, and has been deployed to serve over 300 pilot households in Dubuque, IA. Interesting activity-level consumption patterns have been identified, and the evaluation on both real and synthetic datasets has shown high accuracy on discovering washer and shower.

【Keywords】: classification; disaggregation; gaussian mixture model; hidden markov model; low sample rate; smart meter

Deployed 2 4

32. Estimating the number of users behind ip addresses for combating abusive traffic.

【Paper Link】【Pages】:249-257

【Authors】: Ahmed Metwally ; Matt Paduano

【Abstract】: This paper addresses estimating the number of the users of a specific application behind IP addresses (IPs). This problem is central to combating abusive traffic, such as DDoS attacks, ad click fraud and email spam. We share our experience building a general framework at Google for estimating the number of users behind IPs, called hereinafter the sizes of the IPs. The primary goal of this framework is combating abusive traffic without violating the user privacy. The estimation techniques produce statistically sound estimates of sizes relying solely on passively mining aggregated application log data, without probing machines or deploying active content like Java applets. This paper also explores using the estimated sizes to detect and filter abusive traffic. The proposed framework was used to build and deploy an ad click fraud filter at Google. The first 50M clicks tagged by the filter had a significant recall of all tagged clicks, and their false positive rate was below 1.4%. For the sake of comparison, we simulated a naive IP-based filter that does not consider the sizes of the IPs. To reach a comparable recall, the naive filter's false positive rate was 37% due to aggressive tagging.

【Keywords】: abusive traffic filtering; advertisement click fraud; ip size estimation; real data experiments

33. Data-driven multi-touch attribution models.

【Paper Link】【Pages】:258-264

【Authors】: Xuhui Shao ; Lexin Li

【Abstract】: In digital advertising, attribution is the problem of assigning credit to one or more advertisements for driving the user to the desirable actions such as making a purchase. Rather than giving all the credit to the last ad a user sees, multi-touch attribution allows more than one ads to get the credit based on their corresponding contributions. Multi-touch attribution is one of the most important problems in digital advertising, especially when multiple media channels, such as search, display, social, mobile and video are involved. Due to the lack of statistical framework and a viable modeling approach, true data-driven methodology does not exist today in the industry. While predictive modeling has been thoroughly researched in recent years in the digital advertising domain, the attribution problem focuses more on accurate and stable interpretation of the influence of each user interaction to the final user decision rather than just user classification. Traditional classification models fail to achieve those goals. In this paper, we first propose a bivariate metric, one measures the variability of the estimate, and the other measures the accuracy of classifying the positive and negative users. We then develop a bagged logistic regression model, which we show achieves a comparable classification accuracy as a usual logistic regression, but a much more stable estimate of individual advertising channel contributions. We also propose an intuitive and simple probabilistic model to directly quantify the attribution of different advertising channels. We then apply both the bagged logistic model and the probabilistic model to a real-world data set from a multi-channel advertising campaign for a well-known consumer software and services brand. The two models produce consistent general conclusions and thus offer useful cross-validation. The results of our attribution models also shed several important insights that have been validated by the advertising team. We have implemented the probabilistic model in the production advertising platform of the first author's company, and plan to implement the bagged logistic regression in the next product release. We believe availability of such data-driven multi-touch attribution metric and models is a break-through in the digital advertising industry.

【Keywords】: bagged logistic regression; digital advertising; multi-touch attribution model

34. Bid landscape forecasting in online ad exchange marketplace.

【Paper Link】【Pages】:265-273

【Authors】: Ying Cui ; Ruofei Zhang ; Wei Li ; Jianchang Mao

【Abstract】: Display advertising has been a significant source of revenue for publishers and ad networks in online advertising ecosystem. One important business model in online display advertising is Ad Exchange marketplace, also called non-guaranteed delivery (NGD), in which advertisers buy targeted page views and audiences on a spot market through real-time auction. In this paper, we describe a bid landscape forecasting system in NGD marketplace for any advertiser campaign specified by a variety of targeting attributes. In the system, the impressions that satisfy the campaign targeting attributes are partitioned into multiple mutually exclusive samples. Each sample is one unique combination of quantified attribute values. We develop a divide-and-conquer approach that breaks down the campaign-level forecasting problem. First, utilizing a novel star-tree data structure, we forecast the bid for each sample using non-linear regression by gradient boosting decision trees. Then we employ a mixture-of-log-normal model to generate campaign-level bid distribution based on the sample-level forecasted distributions. The experiment results of a system developed with our approach show that it can accurately forecast the bid distributions for various campaigns running on the world's largest NGD advertising exchange system, outperforming two baseline methods in term of forecasting errors.

【Keywords】: ad exchange marketplace; bid landscape forecasting; online display advertising

35. Detecting adversarial advertisements in the wild.

【Paper Link】【Pages】:274-282

【Authors】: D. Sculley ; Matthew Eric Otey ; Michael Pohl ; Bridget Spitznagel ; John Hainsworth ; Yunkai Zhou

【Abstract】: In a large online advertising system, adversaries may attempt to profit from the creation of low quality or harmful advertisements. In this paper, we present a large scale data mining effort that detects and blocks such adversarial advertisements for the benefit and safety of our users. Because both false positives and false negatives have high cost, our deployed system uses a tiered strategy combining automated and semi-automated methods to ensure reliable classification. We also employ strategies to address the challenges of learning from highly skewed data at scale, allocating the effort of human experts, leveraging domain expert knowledge, and independently assessing the effectiveness of our system.

【Keywords】: adversarial learning; data mining; online advertisement

Discovery 4

36. Applying data mining techniques to address disaster information management challenges on mobile devices.

【Paper Link】【Pages】:283-291

【Authors】: Li Zheng ; Chao Shen ; Liang Tang ; Tao Li ; Steven Luis ; Shu-Ching Chen

【Abstract】: The improvement of Crisis Management and Disaster Recovery techniques are national priorities in the wake of man-made and nature inflicted calamities of the last decade. Our prior work has demonstrated that the efficiency of sharing and managing information plays an important role in business recovery efforts after disaster event. With the proliferation of smart phones and wireless tablets, professionals who have an operational responsibility in disaster situations are relying on such devices to maintain communication. Further, with the rise of social media, technology savvy consumers are also using these devices extensively for situational updates. In this paper, we address several critical tasks which can facilitate information sharing and collaboration between both private and public sector participants for major disaster recovery planning and management. We design and implement an All-Hazard Disaster Situation Browser (ADSB) system that runs on Apple's mobile operating system (iOS) and iPhone and iPad mobile devices. Our proposed techniques create a collaborative solution on a mobile platform using advanced data mining and information retrieval techniques for disaster preparedness and recovery that helps impacted communities better understand the current disaster situation and how the community is recovering. Specifically, hierarchical summarization techniques are used to generate brief reviews from a large collection of reports at different granularities; probabilistic models are proposed to dynamically generate query forms based on user's feedback; and recommendation techniques are adapted to help users identify potential contacts for report sharing and community organization. Furthermore, the developed techniques are designed to be all-hazard capable so that they can be used in earthquake, terrorism, or other unanticipated disaster situations.

【Keywords】: data mining; disaster information management; dynamic query form; hierarchical summarization; user recommendation

37. Enhancing investment decisions in P2P lending: an investor composition perspective.

【Paper Link】【Pages】:292-300

【Authors】: Chunyu Luo ; Hui Xiong ; Wenjun Zhou ; Yanhong Guo ; Guishi Deng

【Abstract】: P2P lending, as a novel economic lending model, has imposed new challenges about how to make effective investment decisions. Indeed, a key challenge along this line is how to align the right information with the right people. For a long time, people have made tremendous efforts in establishing credit records for the borrowers. However, information from investors is still under-explored for improving investment decisions in P2P lending. To that end, we propose a data driven investment decision-making framework, which exploits the investor composition of each investment for enhancing decisions making in P2P lending. Specifically, we first build investor profiles based on quantitative analysis of past performances, risk preferences, and investment experiences of investors. Then, based on investor profiles, we develop an investor composition analysis model, which can be used to select valuable investments and improve the investment decisions. To validate the proposed model, we perform extensive experiments on the real-world data from the world's largest P2P lending marketplace. Experimental results reveal that investor composition can help us evaluate the profit potential of an investment and the decision model based on investor composition can help investors make better investment decisions.

【Keywords】: investor composition; investor profile; p2p lending

38. From market baskets to mole rats: using data mining techniques to analyze RFID data describing laboratory animal behavior.

【Paper Link】【Pages】:301-306

【Authors】: Daniel P. McCloskey ; Michael E. Kress ; Susan P. Imberman ; Igor Kushnir ; Susan Briffa-Mirabella

【Abstract】: The use of new technologies, such as RFID sensors, provides scientists with novel ways of doing experimental research. As scientists become more technologically savvy and use these techniques, the traditional approaches to data analysis fail given the huge amounts of data produced by these methods. In this paper we describe an experiment in which colonies of naked mole rats were tagged with RFID transponders. RFID sensors were strategically placed in the mole rat caging system. The goal of this experiment was to document and analyze the interactions between animals. The huge amount of data produced by the sensors was not analyzable using the traditional methods employed by behavioral neuroscience researchers. Computational methods used by data miners, such as cluster analysis, association rule mining, and graphical models, were able to scale to the data and produce knowledge and insight that was previously unknown. This paper describes in detail the experimental setup and the computational methods used.

【Keywords】: association rule mining; cluster analysis; maximal frequent itemsets; radio frequency identification; visualization

39. A pattern discovery approach to retail fraud detection.

【Paper Link】【Pages】:307-315

【Authors】: Prasad Gabbur ; Sharath Pankanti ; Quanfu Fan ; Hoang Trinh

【Abstract】: A major source of revenue shrink in retail stores is the intentional or unintentional failure of proper checking out of items by the cashier. More recently, a few automated surveillance systems have been developed to monitor cashier lanes and detect non-compliant activities such as fake item checkouts or scans done with the intention of deriving monetary benefit. These systems use data from surveillance video cameras and transaction logs (TLog) recorded at the Point-of-Sale (POS). In this paper, we present a pattern discovery based approach to detect fraudulent events at the POS. Our approach is based on mining time-ordered text streams, representing retail transactions, formed from a combination of visually detected checkout related activities called primitives and barcodes from TLog data. Patterns representing single item checkouts, i.e. anchored around a single barcode, are discovered from these text streams using an efficient pattern discovery technique called Teiresias. The discovered patterns are used to build models for true and fake item scans by retaining or discarding the anchoring barcodes in those patterns respectively. A pattern matching and classification scheme is designed to robustly detect non-compliant cashier activities in the presence of noise in either the TLog or the video data. Different weighting schemes for quantifying the relative importance of the discovered patterns are explored: Frequency, Support Vector Machine (SVM) and Frequency+SVM. Using a large scale dataset recorded from retail stores, our approach discovers semantically meaningful cashier scan patterns. Our experiments also suggest that different weighting schemes result in varied false and true positive performances on the task of fake scan detection.

【Keywords】: pattern discovery; retail; security; sweethearting; teiresias

Emerging 1 5

40. Driving with knowledge from the physical world.

【Paper Link】【Pages】:316-324

【Authors】: Jing Yuan ; Yu Zheng ; Xing Xie ; Guangzhong Sun

【Abstract】: This paper presents a Cloud-based system computing customized and practically fast driving routes for an end user using (historical and real-time) traffic conditions and driver behavior. In this system, GPS-equipped taxicabs are employed as mobile sensors constantly probing the traffic rhythm of a city and taxi drivers' intelligence in choosing driving directions in the physical world. Meanwhile, a Cloud aggregates and mines the information from these taxis and other sources from the Internet, like Web maps and weather forecast. The Cloud builds a model incorporating day of the week, time of day, weather conditions, and individual driving strategies (both of the taxi drivers and of the end user for whom the route is being computed). Using this model, our system predicts the traffic conditions of a future time (when the computed route is actually driven) and performs a self-adaptive driving direction service for a particular user. This service gradually learns a user's driving behavior from the user's GPS logs and customizes the fastest route for the user with the help of the Cloud. We evaluate our service using a real-world dataset generated by over 33,000 taxis over a period of 3 months in Beijing. As a result, our service accurately estimates the travel time of a route for a user; hence finding the fastest route customized for the user.

【Keywords】: cyber-physical; driving directions; traffic prediction; trajectory

41. Interactive learning for efficiently detecting errors in insurance claims.

【Paper Link】【Pages】:325-333

【Authors】: Rayid Ghani ; Mohit Kumar

【Abstract】: Many practical data mining systems such as those for fraud detection and surveillance deal with building classifiers that are not autonomous but part of a larger interactive system with an expert in the loop. The goal of these systems is not just to maximize the performance of the classifier but to make the experts more efficient at performing their task, thus maximizing the overall Return on Investment of the system. This paper describes an interactive system for detecting payment errors in insurance claims with claim auditors in the loop. We describe an interactive claims prioritization component that uses an online cost-sensitive learning approach (more-like-this) to make the system efficient. Our interactive prioritization component is built on top of a batch classifier that has been trained to detect payment errors in health insurance claims and optimizes the interaction between the classifier and the domain experts who are consuming the results of this system. The goal is to make these auditors more efficient and effective as well as improving the classification performance of the system. The result is both a reduction in time it takes for the auditors to review and label claims as well as improving the precision of the system in finding payment errors. We show results obtained from applying this system at two major US health insurance companies indicating significant reduction in claim audit costs and potential savings of $20-$26 million/year making the insurance providers more efficient and lowering their operating costs. Our system reduces the money being wasted by providers and insurers dealing with incorrectly processed claims and makes the healthcare system more efficient.

【Keywords】: budgeted learning; cost sensitive learning; error detection; health insurance claims; interactive machine learning; quality control

42. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce.

【Paper Link】【Pages】:334-342

【Authors】: Amol Ghoting ; Prabhanjan Kambadur ; Edwin P. D. Pednault ; Ramakrishnan Kannan

【Abstract】: In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.

【Keywords】: data mining; machine learning; map/reduce; parallelism

43. Classification of proxy labeled examples for marketing segment generation.

【Paper Link】【Pages】:343-350

【Authors】: Dean Cerrato ; Rosie Jones ; Avinash Gupta

【Abstract】: Marketers often rely on a set of descriptive segments, or qualitative subsets of the population, to specify the audiences of targeted advertising campaigns. For example, the descriptive segment "Empty Nesters" might describe a desirable target audience for extended vacation package offers. While some segments may be easily described and generated using demographic data as ground truth, others such as "Soccer Moms" or "Urban Hipsters" reflect a combination of demographic and behavioral attributes. Ideally, these attributes would be available as the basis for ground truth labeling of a classifier training set or even direct member selection from the population. Unfortunately, ground truth attributes are often scarce or unavailable, in which case a proxy labeling scheme is needed. We devise a method for labeling a population according to criteria based on a postulated set of shopping behaviors specific to a descriptive segment. We then perform supervised binary classification on this labeled dataset in order to discover additional identifying patterns of behavior typical of labeled positives in the population. Finally, the resulting classifier is used to perform selection from the population into the segment, extending reach to cookies who may not have exhibited the postulated behaviors but likely belong in the segment. We validate our approach by comparing a descriptive segment trained on ground truth to one trained on behavioral attributes only. We show that our behavior-based approach produces classifiers having performance comparable to that of a classifier trained on the ground truth data.

【Keywords】: classification; computational advertising; marketing segments; ripper; rules

44. Ameliorating buyer's remorse.

【Paper Link】【Pages】:351-359

【Authors】: Rakesh Agrawal ; Samuel Ieong ; Raja Velu

【Abstract】: Keeping in pace with the increasing importance of commerce conducted over the Web, several e-commerce websites now provide admirable facilities for helping consumers decide what product to buy and where to buy it. However, since the prices of durable and high-tech products generally fall over time, a buyer of such products is often faced with a dilemma: Should she buy the product now or wait for cheaper prices? We present the design and implementation of Prodcast, an experimental system whose goal is to help consumers decide when to buy a product. The system makes use of forecasts of future prices based on price histories of the products, incorporating features such as sales volume, seasonality, and competition in making its recommendation. We describe techniques that are well-suited for this task and present a comprehensive evaluation of their relative merits using retail sales data for electronic products. Our back-testing of the system indicates that the system is capable of helping consumers time their purchase, resulting in significant savings to them.

【Keywords】: forecasting; prodcast; recommendation

Emerging 2 5

45. Experiences with mining temporal event sequences from electronic medical records: initial successes and some challenges.

【Paper Link】【Pages】:360-368

【Authors】: Debprakash Patnaik ; Patrick Butler ; Naren Ramakrishnan ; Laxmi Parida ; Benjamin J. Keller ; David A. Hanauer

【Abstract】: The standardization and wider use of electronic medical records (EMR) creates opportunities for better understanding patterns of illness and care within and across medical systems. Our interest is in the temporal history of event codes embedded in patients' records, specifically investigating frequently occurring sequences of event codes across patients. In studying data from more than 1.6 million patient histories at the University of Michigan Health system we quickly realized that frequent sequences, while providing one level of data reduction, still constitute a serious analytical challenge as many involve alternate serializations of the same sets of codes. To further analyze these sequences, we designed an approach where a partial order is mined from frequent sequences of codes. We demonstrate an EMR mining system called EMRView that enables exploration of the precedence relationships to quickly identify and visualize partial order information encoded in key classes of patients. We demonstrate some important nuggets learned through our approach and also outline key challenges for future research based on our experiences.

【Keywords】: medical informatics; partial orders; temporal data mining

46. Understanding atrophy trajectories in alzheimer's disease using association rules on MRI images.

【Paper Link】【Pages】:369-376

【Authors】: György J. Simon ; Peter W. Li ; Clifford R. Jack Jr. ; Prashanthi Vemuri

【Abstract】: Alzheimer's disease (AD) is associated with progressive cognitive decline leading to dementia. The atrophy/loss of brain structure as seen on Magnetic Resonance Imaging (MRI) is strongly correlated with the severity of the cognitive impairment in AD. In this paper, we set out to find associations between predefined regions of the brain (regions of interest; ROIs) and the severity of the disease. Specifically, we use these associations to address two important issues in AD: (i) typical versus atypical atrophy patterns and (ii) the origin and direction of progression of atrophy, which is currently under debate. We observed that each AD-related ROI is associated with a wide range of severity and that the difference between ROIs is merely a difference in severity distribution. To model differences between the severity distribution of a subpopulation (with significant atrophy in certain ROIs) and the severity distribution of the entire population, we developed the concept of Distributional Association Rules. Using the Distributional Association Rules, we clustered ROIs into disease subsystems. We define a disease subsystem as a contiguous set of ROIs that are collectively implicated in AD. AD is known to be heterogeneous in the sense that multiple sets of ROIs may be related to the disease in a population. We proposed an enhancement to the association rule mining where the algorithm only discovers association rules with ROIs that form an approximately contiguous volume. Next, we applied these association rules to infer the direction of disease progression based on the support measures of the association rules. We also developed a novel statistical test to determine the statistical significance of the discovered direction. We evaluated the proposed method on the Mayo Clinic Alzheimer's Disease Research Center (ADRC) prospective patient cohorts. The key achievements of the methodology is that it accurately identified larger disease subsystems implicated in typical and atypical AD and it successfully mapped the directions of disease progression. The wealth of data available in Radiology gives rise to opportunities for applying this methodology to map out the trajectory of several other diseases, e.g. other neuro-degenerative diseases and cancers, most notably, breast cancer. The applicability of this method is not limited to image data, as associating predictors with severity provides valuable information in most areas of medicine as well as other industries.

【Keywords】: continuous outcome; disease progression; mri; spatial association analysis

47. A case study in a recommender system based on purchase data.

【Paper Link】【Pages】:377-385

【Authors】: Bruno Pradel ; Savaneary Sean ; Julien Delporte ; Sébastien Guérif ; Céline Rouveirol ; Nicolas Usunier ; Françoise Fogelman-Soulié ; Frédéric Dufau-Joël

【Abstract】: Collaborative filtering has been extensively studied in the context of ratings prediction. However, industrial recommender systems often aim at predicting a few items of immediate interest to the user, typically products that (s)he is likely to buy in the near future. In a collaborative filtering setting, the prediction may be based on the user's purchase history rather than rating information, which may be unreliable or unavailable. In this paper, we present an experimental evaluation of various collaborative filtering algorithms on a real-world dataset of purchase history from customers in a store of a French home improvement and building supplies chain. These experiments are part of the development of a prototype recommender system for salespeople in the store. We show how different settings for training and applying the models, as well as the introduction of domain knowledge may dramatically influence both the absolute and the relative performances of the different algorithms. To the best of our knowledge, the influence of these parameters on the quality of the predictions of recommender systems has rarely been reported in the literature.

【Keywords】: collaborative filtering; recommender systems

48. Detecting bots via incremental LS-SVM learning with dynamic feature adaptation.

【Paper Link】【Pages】:386-394

【Authors】: Feilong Chen ; Supranamaya Ranjan ; Pang-Ning Tan

【Abstract】: As botnets continue to proliferate and grow in sophistication, so does the need for more advanced security solutions to effectively detect and defend against such attacks. In particular, botnets such as Conficker have been known to encrypt the communication packets exchanged between bots and their command-and-control server, making it costly for existing botnet detection systems that rely on deep packet inspection (DPI) methods to identify compromised machines. In this paper, we argue that, even in the face of encrypted traffic flows, botnets can still be detected by examining the set of server IP-addresses visited by a client machine in the past. However there are several challenges that must be addressed. First, the set of server IP-addresses visited by client machines may evolve dynamically. Second, the set of client machines used for training and their class labels may also change over time. To overcome these challenges, this paper presents a novel incremental LS-SVM algorithm that is adaptive to both changes in the feature set and class labels of training instances. To evaluate the performance of our algorithm, we have performed experiments on two large-scale datasets, including real-time data collected from peering routers at a large Tier-1 ISP. Experimental results showed that the proposed algorithm produces classification accuracy comparable to its batch counterpart, while consuming significantly less computational resources.

【Keywords】: botnet detection; online learning; support vector machines

49. Toward personalized care management of patients at risk: the diabetes case study.

【Paper Link】【Pages】:395-403

【Authors】: Hani Neuvirth ; Michal Ozery-Flato ; Jianying Hu ; Jonathan Laserson ; Martin S. Kohn ; Shahram Ebadollahi ; Michal Rosen-Zvi

【Abstract】: Chronic diseases constitute the leading cause of mortality in the western world, have a major impact on the patients' quality of life, and comprise the bulk of healthcare costs. Nowadays, healthcare data management systems integrate large amounts of medical information on patients, including diagnoses, medical procedures, lab test results, and more. Sophisticated analysis methods are needed for utilizing these data to assist in patient management and to enhance treatment quality at reduced costs. In this study, we take a first step towards better disease management of diabetic patients by applying state-of-the art methods to anticipate the patient's future health condition and to identify patients at high risk. Two relevant outcome measures are explored: the need for emergency care services and the probability of the treatment producing a sub-optimal result, as defined by domain experts. By identifying the high-risk patients our prediction system can be used by healthcare providers to prepare both financially and logistically for the patient needs. To demonstrate a potential downstream application for the identified high-risk patients, we explore the association between the physician treating these patients and the treatment outcome, and propose a system that can assist healthcare providers in optimizing the match between a patient and a physician. Our work formulates the problem and examines the performance of several learning models on data from several thousands of patients. We further describe a pilot system built on the results of this analysis. We show that the risk for the two considered outcomes can be evaluated from patients' characteristics and that features of the patient-physician match improve the prediction accuracy for the treatment's success. These results suggest that personalized medicine can be valuable for high risk patients and raise interesting questions for future improvements.

【Keywords】: machine learning; survival analysis

Emerging 3 4

50. Matching unstructured product offers to structured product specifications.

【Paper Link】【Pages】:404-412

【Authors】: Anitha Kannan ; Inmar E. Givoni ; Rakesh Agrawal ; Ariel Fuxman

【Abstract】: An e-commerce catalog typically comprises of specifications for millions of products. The search engine receives millions of sales offers from thousands of independent merchants that must be matched to the right products. We describe the challenges that a system for matching unstructured offers to structured product descriptions must address, drawing upon our experience from building such a system for Bing Shopping. The heart of our system is a data-driven component that learns the matching function off-line, which is then applied at run-time for matching offers to products. We provide the design of this and other critical components of the system as well as the details of the extensive experiments we performed to assess the readiness of the system. This system is currently deployed in an experimental Commerce Search Engine and is used to match all the offers received by Bing Shopping to the Bing product catalog.

【Keywords】: commerce search; matching; structured data; unstructured data

51. Predictive client-side profiles for personalized advertising.

【Paper Link】【Pages】:413-421

【Authors】: Mikhail Bilenko ; Matthew Richardson

【Abstract】: Personalization is ubiquitous in modern online applications as it provides significant improvements in user experience by adapting it to inferred user preferences. However, there are increasing concerns related to issues of privacy and control of the user data that is aggregated by online systems to power personalized experiences. These concerns are particularly significant for user profile aggregation in online advertising. This paper describes a practical, learning-driven client-side personalization approach for keyword advertising platforms, an emerging application previously not addressed in literature. Our approach relies on storing user-specific information entirely within the user's control (in a browser cookie or browser local storage), thus allowing the user to view, edit or purge it at any time (e.g., via a dedicated webpage). We develop a principled, utility-based formulation for the problem of iteratively updating user profiles stored client-side, which relies on calibrated prediction of future user activity. While optimal profile construction is NP-hard for pay-per-click advertising with bid increments, it can be efficiently solved via a greedy approximation algorithm guaranteed to provide a near-optimal solution due to the fact that keyword profile utility is submodular: it exhibits the property of diminishing returns with increasing profile size. We empirically evaluate client-side keyword profiles for keyword advertising on a large-scale dataset from a major search engine. Experiments demonstrate that predictive client-side personalization allows ad platforms to retain almost all of the revenue gains from personalization even if they give users the freedom to opt out of behavior tracking backed by server-side storage. Additionally, we show that advertisers can potentially increase their return on investment significantly by utilizing bid increments for keyword profiles in their ad campaigns.

【Keywords】: client-side personalization; online advertising

52. Smoothing techniques for adaptive online language models: topic tracking in tweet streams.

【Paper Link】【Pages】:422-429

【Authors】: Jimmy Lin ; Rion Snow ; William Morgan

【Abstract】: We are interested in the problem of tracking broad topics such as "baseball" and "fashion" in continuous streams of short texts, exemplified by tweets from the microblogging service Twitter. The task is conceived as a language modeling problem where per-topic models are trained using hashtags in the tweet stream, which serve as proxies for topic labels. Simple perplexity-based classifiers are then applied to filter the tweet stream for topics of interest. Within this framework, we evaluate, both intrinsically and extrinsically, smoothing techniques for integrating "foreground" models (to capture recency) and "background" models (to combat sparsity), as well as different techniques for retaining history. Experiments show that unigram language models smoothed using a normalized extension of stupid backoff and a simple queue for history retention performs well on the task.

【Keywords】: stream processing; tdt; twitter

53. Democrats, republicans and starbucks afficionados: user classification in twitter.

【Paper Link】【Pages】:430-438

【Authors】: Marco Pennacchiotti ; Ana-Maria Popescu

【Abstract】: More and more technologies are taking advantage of the explosion of social media (Web search, content recommendation services, marketing, ad targeting, etc.). This paper focuses on the problem of automatically constructing user profiles, which can significantly benefit such technologies. We describe a general and robust machine learning framework for large-scale classification of social media users according to dimensions of interest. We report encouraging experimental results on 3 tasks with different characteristics: political affiliation detection, ethnicity identification and detecting affinity for a particular business.

【Keywords】: machine learning; microblogging; social media; user profiling

Text mining 1 3

54. Beyond keyword search: discovering relevant scientific literature.

【Paper Link】【Pages】:439-447

【Authors】: Khalid El-Arini ; Carlos Guestrin

【Abstract】: In scientific research, it is often difficult to express information needs as simple keyword queries. We present a more natural way of searching for relevant scientific literature. Rather than a string of keywords, we define a query as a small set of papers deemed relevant to the research task at hand. By optimizing an objective function based on a fine-grained notion of influence between documents, our approach efficiently selects a set of highly relevant articles. Moreover, as scientists trust some authors more than others, results are personalized to individual preferences. In a user study, researchers found the papers recommended by our method to be more useful, trustworthy and diverse than those selected by popular alternatives, such as Google Scholar and a state-of-the-art topic modeling approach.

【Keywords】: citation analysis; personalization

55. Collaborative topic modeling for recommending scientific articles.

【Paper Link】【Pages】:448-456

【Authors】: Chong Wang ; David M. Blei

【Abstract】: Researchers have access to large online archives of scientific articles. As a consequence, finding relevant papers has become more difficult. Newly formed online communities of researchers sharing citations provides a new way to solve this problem. In this paper, we develop an algorithm to recommend scientific articles to users of an online community. Our approach combines the merits of traditional collaborative filtering and probabilistic topic modeling. It provides an interpretable latent structure for users and items, and can form recommendations about both existing and newly published articles. We study a large subset of data from CiteULike, a bibliography sharing service, and show that our algorithm provides a more effective recommender system than traditional collaborative filtering.

【Keywords】: collaborative filtering; latent structure interpretation; scientific article recommendation; topic modeling

56. Partially labeled topic models for interpretable text mining.

【Paper Link】【Pages】:457-465

【Authors】: Daniel Ramage ; Christopher D. Manning ; Susan T. Dumais

【Abstract】: Abstract Much of the world's electronic text is annotated with human-interpretable labels, such as tags on web pages and subject codes on academic publications. Effective text mining in this setting requires models that can flexibly account for the textual patterns that underlie the observed labels while still discovering unlabeled topics. Neither supervised classification, with its focus on label prediction, nor purely unsupervised learning, which does not model the labels explicitly, is appropriate. In this paper, we present two new partially supervised generative models of labeled text, Partially Labeled Dirichlet Allocation (PLDA) and the Partially Labeled Dirichlet Process (PLDP). These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpus-wide latent topics. We explore applications with qualitative case studies of tagged web pages from del.icio.us and PhD dissertation abstracts, demonstrating improved model interpretability over traditional topic models. We use the many tags present in our del.icio.us dataset to quantitatively demonstrate the new models' higher correlation with human relatedness scores over several strong baselines.

【Keywords】: partially supervised learning; text mining

Text mining 2 3

57. Refining causality: who copied from whom?

【Paper Link】【Pages】:466-474

【Authors】: Tristan Mark Snowsill ; Nick Fyson ; Tijl De Bie ; Nello Cristianini

【Abstract】: Inferring causal networks behind observed data is an active area of research with wide applicability to areas such as epidemiology, microbiology and social science. In particular recent research has focused on identifying how information propagates through the Internet. This research has so far only used temporal features of observations, and while reasonable results have been achieved, there is often further information which can be used. In this paper we show that additional features of the observed data can be used very effectively to improve an existing method. Our particular example is one of inferring an underlying network for how text is reused in the Internet, although the general approach is applicable to other inference methods and information sources. We develop a method to identify how a piece of text evolves as it moves through an underlying network and how substring information can be used to narrow down where in the evolutionary process a particular observation at a node lies. Hence we narrow down the number of ways the node could have acquired the infection. Text reuse is detected using a suffix tree which is also used to identify the substring relations between chunks of reused text. We then use a modification of the NetCover method to infer the underlying network. Experimental results -- on both synthetic and real life data -- show that using more information than just timing leads to greater accuracy in the inferred networks.

【Keywords】: network inference; network reconstruction; suffix tree

58. Conditional topical coding: an efficient topic model conditioned on rich features.

【Paper Link】【Pages】:475-483

【Authors】: Jun Zhu ; Ni Lao ; Ning Chen ; Eric P. Xing

【Abstract】: Probabilistic topic models have shown remarkable success in many application domains. However, a probabilistic conditional topic model can be extremely inefficient when considering a rich set of features because it needs to define a normalized distribution, which usually involves a hard-to-compute partition function. This paper presents conditional topical coding (CTC), a novel formulation of conditional topic models which is non-probabilistic. CTC relaxes the normalization constraints as in probabilistic models and learns non-negative document codes and word codes. CTC does not need to define a normalized distribution and can efficiently incorporate a rich set of features for improved topic discovery and prediction tasks. Moreover, CTC can directly control the sparsity of inferred representations by using appropriate regularization. We develop an efficient and easy-to-implement coordinate descent learning algorithm, of which each coding substep has a closed-form solution. Finally, we demonstrate the advantages of CTC on online review analysis datasets. Our results show that conditional topical coding can achieve state-of-the-art prediction performance and is much more efficient in training (one order of magnitude faster) and testing (two orders of magnitude faster) than probabilistic conditional topic models.

【Keywords】: conditional models; sparse coding; topic models

59. Tracking trends: incorporating term volume into temporal topic models.

【Paper Link】【Pages】:484-492

【Authors】: Liangjie Hong ; Dawei Yin ; Jian Guo ; Brian D. Davison

【Abstract】: Text corpora with documents from a range of time epochs are natural and ubiquitous in many fields, such as research papers, newspaper articles and a variety of types of recently emerged social media. People not only would like to know what kind of topics can be found from these data sources but also wish to understand the temporal dynamics of these topics and predict certain properties of terms or documents in the future. Topic models are usually utilized to find latent topics from text collections, and recently have been applied to temporal text corpora. However, most proposed models are general purpose models to which no real tasks are explicitly associated. Therefore, current models may be difficult to apply in real-world applications, such as the problems of tracking trends and predicting popularity of keywords. In this paper, we introduce a real-world task, tracking trends of terms, to which temporal topic models can be applied. Rather than building a general-purpose model, we propose a new type of topic model that incorporates the volume of terms into the temporal dynamics of topics and optimizes estimates of term volumes. In existing models, trends are either latent variables or not considered at all which limits the potential for practical use of trend information. In contrast, we combine state-space models with term volumes with a supervised learning model, enabling us to effectively predict the volume in the future, even without new documents. In addition, it is straightforward to obtain the volume of latent topics as a by-product of our model, demonstrating the superiority of utilizing temporal topic models over traditional time-series tools (e.g., autoregressive models) to tackle this kind of problem. The proposed model can be further extended with arbitrary word-level features which are evolving over time. We present the results of applying the model to two datasets with long time periods and show its effectiveness over non-trivial baselines.

【Keywords】: temporal dynamics; text mining; topic models

Privacy 3

60. Differentially private data release for data mining.

【Paper Link】【Pages】:493-501

【Authors】: Noman Mohammed ; Rui Chen ; Benjamin C. M. Fung ; Philip S. Yu

【Abstract】: Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among the existing privacy models, ∈-differential privacy provides one of the strongest privacy guarantees and has no assumptions about an adversary's background knowledge. Most of the existing solutions that ensure ∈-differential privacy are based on an interactive model, where the data miner is only allowed to pose aggregate queries to the database. In this paper, we propose the first anonymization algorithm for the non-interactive setting based on the generalization technique. The proposed solution first probabilistically generalizes the raw data and then adds noise to guarantee ∈-differential privacy. As a sample application, we show that the anonymized data can be used effectively to build a decision tree induction classifier. Experimental results demonstrate that the proposed non-interactive anonymization algorithm is scalable and performs better than the existing solutions for classification analysis.

【Keywords】: anonymization; data mining; differential privacy

61. k-NN as an implementation of situation testing for discrimination discovery and prevention.

【Paper Link】【Pages】:502-510

【Authors】: Binh Luong Thanh ; Salvatore Ruggieri ; Franco Turini

【Abstract】: With the support of the legally-grounded methodology of situation testing, we tackle the problems of discrimination discovery and prevention from a dataset of historical decisions by adopting a variant of k-NN classification. A tuple is labeled as discriminated if we can observe a significant difference of treatment among its neighbors belonging to a protected-by-law group and its neighbors not belonging to it. Discrimination discovery boils down to extracting a classification model from the labeled tuples. Discrimination prevention is tackled by changing the decision value for tuples labeled as discriminated before training a classifier. The approach of this paper overcomes legal weaknesses and technical limitations of existing proposals.

【Keywords】: discrimination discovery and prevention; k-nn classification

【Paper Link】【Pages】:511-519

【Authors】: Pritam Gundecha ; Geoffrey Barbier ; Huan Liu

【Abstract】: As (one's) social network expands, a user's privacy protection goes beyond his privacy settings and becomes a social networking problem. In this research, we aim to address some critical issues related to privacy protection: Would the highest privacy settings guarantee a secure protection? Given the open nature of social networking sites, is it possible to manage one's privacy protection? With the diversity of one's social media friends, how can one figure out an effective approach to balance between vulnerability and privacy? We present a novel way to define a vulnerable friend from an individual user's perspective is dependent on whether or not the user's friends' privacy settings protect the friend and the individual's network of friends (which includes the user). As a single vulnerable friend in a user's social network might place all friends at risk, we resort to experiments and observe how much security an individual user can improve by unfriending a vulnerable friend. We also show how privacy weakens if newly accepted friends are unguarded or unprotected. This work provides a large-scale evaluation of new security and privacy indexes using a Facebook dataset. We present and discuss a new perspective for reasoning about social networking security. When a user accepts a new friend, the user should ensure that the new friend is not an increased security risk with the potential of negatively impacting the entire friend network. Additionally, by leveraging the indexes proposed and employing new strategies for unfriending vulnerable friends, it is possible to further improve security and privacy without changing the social networking site's existing architecture.

【Keywords】: privacy; social network; vulnerability

【Paper Link】【Pages】:520-528

【Authors】: Mao Ye ; Dong Shou ; Wang-Chien Lee ; Peifeng Yin ; Krzysztof Janowicz

【Abstract】: In this paper, we develop a semantic annotation technique for location-based social networks to automatically annotate all places with category tags which are a crucial prerequisite for location search, recommendation services, or data cleaning. Our annotation algorithm learns a binary support vector machine (SVM) classifier for each tag in the tag space to support multi-label classification. Based on the check-in behavior of users, we extract features of places from i) explicit patterns (EP) of individual places and ii) implicit relatedness (IR) among similar places. The features extracted from EP are summarized from all check-ins at a specific place. The features from IR are derived by building a novel network of related places (NRP) where similar places are linked by virtual edges. Upon NRP, we determine the probability of a category tag for each place by exploring the relatedness of places. Finally, we conduct a comprehensive experimental study based on a real dataset collected from a location-based social network, Whrrl. The results demonstrate the suitability of our approach and show the strength of taking both EP and IR into account in feature extraction.

【Keywords】: location-based social networks; points of interest; semantic annotation; user behavior

64. Sparsification of influence networks.

【Paper Link】【Pages】:529-537

【Authors】: Michael Mathioudakis ; Francesco Bonchi ; Carlos Castillo ; Aristides Gionis ; Antti Ukkonen

【Abstract】: We present Spine, an efficient algorithm for finding the "backbone" of an influence network. Given a social graph and a log of past propagations, we build an instance of the independent-cascade model that describes the propagations. We aim at reducing the complexity of that model, while preserving most of its accuracy in describing the data. We show that the problem is inapproximable and we present an optimal, dynamic-programming algorithm, whose search space, albeit exponential, is typically much smaller than that of the brute force, exhaustive-search approach. Seeking a practical, scalable approach to sparsification, we devise Spine, a greedy, efficient algorithm with practically little compromise in quality. We claim that sparsification is a fundamental data-reduction operation with many applications, ranging from visualization to exploratory and descriptive data analysis. As a proof of concept, we use Spine on real-world datasets, revealing the backbone of their influence-propagation networks. Moreover, we apply Spine as a pre-processing step for the influence-maximization problem, showing that computations on sparsified models give up little accuracy, but yield significant improvements in terms of scalability.

【Keywords】: influence; propagation; social networks

65. Leveraging collaborative tagging for web item design.

【Paper Link】【Pages】:538-546

【Authors】: Mahashweta Das ; Gautam Das ; Vagelis Hristidis

【Abstract】: The popularity of collaborative tagging sites has created new challenges and opportunities for designers of web items, such as electronics products, travel itineraries, popular blogs, etc. An increasing number of people are turning to online reviews and user-specified tags to choose from among competing items. This creates an opportunity for designers to build items that are likely to attract desirable tags when published. In this paper, we consider a novel optimization problem: given a training dataset of existing items with their user-submitted tags, and a query set of desirable tags, design the k best new items expected to attract the maximum number of desirable tags. We show that this problem is NP-Complete, even if simple Naive Bayes Classifiers are used for tag prediction. We present two principled algorithms for solving this problem: (a) an exact "two-tier" algorithm (based on top-k querying techniques), which performs much better than the naive brute-force algorithm and works well for moderate problem instances, and (b) a novel polynomial-time approximation algorithm with provable error bound for larger problem instances. We conduct detailed experiments on synthetic and real data crawled from the web to evaluate the efficiency and quality of our proposed algorithms.

【Keywords】: collaborative tagging; item design; naive bayes; optimization

Theory 3

66. Stackelberg games for adversarial prediction problems.

【Paper Link】【Pages】:547-555

【Authors】: Michael Brückner ; Tobias Scheffer

【Abstract】: The standard assumption of identically distributed training and test data is violated when test data are generated in response to a predictive model. This becomes apparent, for example, in the context of email spam filtering, where an email service provider employs a spam filter and the spam sender can take this filter into account when generating new emails. We model the interaction between learner and data generator as a Stackelberg competition in which the learner plays the role of the leader and the data generator may react on the leader's move. We derive an optimization problem to determine the solution of this game and present several instances of the Stackelberg prediction game. We show that the Stackelberg prediction game generalizes existing prediction models. Finally, we explore properties of the discussed models empirically in the context of email spam filtering.

【Keywords】: adversarial classification; prediction game; spam filtering; stackelberg competition

67. Leakage in data mining: formulation, detection, and avoidance.

【Paper Link】【Pages】:556-563

【Authors】: Shachar Kaufman ; Saharon Rosset ; Claudia Perlich

【Abstract】: Deemed "one of the top ten data mining mistakes", leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical i.i.d. assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected.

【Keywords】: data mining; leakage; predictive modeling; statistical inference

68. An information theoretic framework for data mining.

【Paper Link】【Pages】:564-572

【Authors】: Tijl De Bie

【Abstract】: We formalize the data mining process as a process of information exchange, defined by the following key components. The data miner's state of mind is modeled as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions the data miner has about the data. This model initially incorporates any prior (possibly incorrect) beliefs a data miner has about the data. During the data mining process, properties of the data (to which we refer as patterns) are revealed to the data miner, either in batch, one by one, or even interactively. This acquisition of information in the data mining process is formalized by updates to the background distribution to account for the presence of the found patterns. The proposed framework can be motivated using concepts from information theory and game theory. Understanding it from this perspective, it is easy to see how it can be extended to more sophisticated settings, e.g. where patterns are probabilistic functions of the data (thus allowing one to account for noise and errors in the data mining process, and allowing one to study data mining techniques based on subsampling the data). The framework then models the data mining process using concepts from information geometry, and I-projections in particular. The framework can be used to help in designing new data mining algorithms that maximize the efficiency of the information exchange from the algorithm to the data miner.

【Keywords】: data mining framework; maxent; subjective interestingness

Frequent sets 3

69. Tell me what i need to know: succinctly summarizing data with itemsets.

【Paper Link】【Pages】:573-581

【Authors】: Michael Mampaey ; Nikolaj Tatti ; Jilles Vreeken

【Abstract】: Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a well-founded approach for succinctly summarizing data with a collection of itemsets; using a probabilistic maximum entropy model, we iteratively find the most interesting itemset, and in turn update our model of the data accordingly. As we only include itemsets that are surprising with regard to the current model, the summary is guaranteed to be both descriptive and non-redundant. The algorithm that we present can either mine the top-k most interesting itemsets, or use the Bayesian Information Criterion to automatically identify the model containing only the itemsets most important for describing the data. Or, in other words, it will 'tell you what you need to know'. Experiments on synthetic and benchmark data show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet non-redundant itemsets.

【Keywords】: itemsets; maximum entropy; summarization

70. Direct local pattern sampling by efficient two-step random procedures.

【Paper Link】【Pages】:582-590

【Authors】: Mario Boley ; Claudio Lucchese ; Daniel Paurat ; Thomas Gärtner

【Abstract】: We present several exact and highly scalable local pattern sampling algorithms. They can be used as an alternative to exhaustive local pattern discovery methods (e.g, frequent set mining or optimistic-estimator-based subgroup discovery) and can substantially improve efficiency as well as controllability of pattern discovery processes. While previous sampling approaches mainly rely on the Markov chain Monte Carlo method, our procedures are direct, i.e., non process-simulating, sampling algorithms. The advantages of these direct methods are an almost optimal time complexity per pattern as well as an exactly controlled distribution of the produced patterns. Namely, the proposed algorithms can sample (item-)sets according to frequency, area, squared frequency, and a class discriminativity measure. Experiments demonstrate that these procedures can improve the accuracy of pattern-based models similar to frequent sets and often also lead to substantial gains in terms of scalability.

【Keywords】: frequent sets; local pattern discovery; pattern-based classification; sampling

71. Mining frequent closed graphs on evolving data streams.

【Paper Link】【Pages】:591-599

【Authors】: Albert Bifet ; Geoff Holmes ; Bernhard Pfahringer ; Ricard Gavaldà

【Abstract】: Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this paper we present a framework for studying graph pattern mining on time-varying streams. Three new methods for mining frequent closed subgraphs are presented. All methods work on coresets of closed subgraphs, compressed representations of graph sets, and maintain these sets in a batch-incremental manner, but use different approaches to address potential concept drift. An evaluation study on datasets comprising up to four million graphs explores the strength and limitations of the proposed methods. To the best of our knowledge this is the first work on mining frequent closed subgraphs in non-stationary data streams.

【Keywords】: closed mining; concept drift; data streams; graphs

Text mining 3 3

72. Latent topic feedback for information retrieval.

【Paper Link】【Pages】:600-608

【Authors】: David Andrzejewski ; David Buttler

【Abstract】: We consider the problem of a user navigating an unfamiliar corpus of text documents where document metadata is limited or unavailable, the domain is specialized, and the user base is small. These challenging conditions may hold, for example, within an organization such as a business or government agency. We propose to augment standard keyword search with user feedback on latent topics. These topics are automatically learned from the corpus in an unsupervised manner and presented alongside search results. User feedback is then used to reformulate the original query, resulting in improved information retrieval performance in our experiments.

【Keywords】: latent topic models; user feedback

73. Localized factor models for multi-context recommendation.

【Paper Link】【Pages】:609-617

【Authors】: Deepak Agarwal ; Bee-Chung Chen ; Bo Long

【Abstract】: Combining correlated information from multiple contexts can significantly improve predictive accuracy in recommender problems. Such information from multiple contexts is often available in the form of several incomplete matrices spanning a set of entities like users, items, features, and so on. Existing methods simultaneously factorize these matrices by sharing a single set of factors for entities across all contexts. We show that such a strategy may introduce significant bias in estimates and propose a new model that ameliorates this issue by positing local, context-specific factors for entities. To avoid over-fitting in contexts with sparse data, the local factors are connected through a shared global model. This sharing of parameters allows information to flow across contexts through multivariate regressions among local factors, instead of enforcing exactly the same factors for an entity, everywhere. Model fitting is done in an EM framework, we show that the E-step can be fitted through a fast multi-resolution Kalman filter algorithm that ensures scalability. Experiments on benchmark and real-world Yahoo! datasets clearly illustrate the usefulness of our approach. Our model significantly improves predictive accuracy, especially in cold-start scenarios.

【Keywords】: data fusion; epinions; kalman filter; meta-analysis; recommender systems

74. Latent aspect rating analysis without aspect keyword supervision.

【Paper Link】【Pages】:618-626

【Authors】: Hongning Wang ; Yue Lu ; ChengXiang Zhai

【Abstract】: Mining detailed opinions buried in the vast amount of review text data is an important, yet quite challenging task with widespread applications in multiple domains. Latent Aspect Rating Analysis (LARA) refers to the task of inferring both opinion ratings on topical aspects (e.g., location, service of a hotel) and the relative weights reviewers have placed on each aspect based on review content and the associated overall ratings. A major limitation of previous work on LARA is the assumption of pre-specified aspects by keywords. However, the aspect information is not always available, and it may be difficult to pre-define appropriate aspects without a good knowledge about what aspects are actually commented on in the reviews. In this paper, we propose a unified generative model for LARA, which does not need pre-specified aspect keywords and simultaneously mines 1) latent topical aspects, 2) ratings on each identified aspect, and 3) weights placed on different aspects by a reviewer. Experiment results on two different review data sets demonstrate that the proposed model can effectively perform the Latent Aspect Rating Analysis task without the supervision of aspect keywords. Because of its generality, the proposed model can be applied to explore all kinds of opinionated text data containing overall sentiment judgments and support a wide range of interesting application tasks, such as aspect-based opinion summarization, personalized entity ranking and recommendation, and reviewer behavior analysis.

【Keywords】: aspect identification; latent rating analysis; review mining

Unsupervised learning 3

75. Density estimation trees.

【Paper Link】【Pages】:627-635

【Authors】: Parikshit Ram ; Alexander G. Gray

【Abstract】: In this paper we develop density estimation trees (DETs), the natural analog of classification trees and regression trees, for the task of density estimation. We consider the estimation of a joint probability density function of a d-dimensional random vector X and define a piecewise constant estimator structured as a decision tree. The integrated squared error is minimized to learn the tree. We show that the method is nonparametric: under standard conditions of nonparametric density estimation, DETs are shown to be asymptotically consistent. In addition, being decision trees, DETs perform automatic feature selection. They empirically exhibit the interpretability, adaptability and feature selection properties of supervised decision trees while incurring slight loss in accuracy over other nonparametric density estimators. Hence they might be able to avoid the curse of dimensionality if the true density is sparse in dimensions. We believe that density estimation trees provide a new tool for exploratory data analysis with unique capabilities.

【Keywords】: decision trees; density estimation

76. Unsupervised clustering of multidimensional distributions using earth mover distance.

【Paper Link】【Pages】:636-644

【Authors】: David Applegate ; Tamraparni Dasu ; Shankar Krishnan ; Simon Urbanek

【Abstract】: Multidimensional distributions are often used in data mining to describe and summarize different features of large datasets. It is natural to look for distinct classes in such datasets by clustering the data. A common approach entails the use of methods like k-means clustering. However, the k-means method inherently relies on the Euclidean metric in the embedded space and does not account for additional topology underlying the distribution. In this paper, we propose using Earth Mover Distance (EMD) to compare multidimensional distributions. For a n-bin histogram, the EMD is based on a solution to the transportation problem with time complexity O(n3 log n). To mitigate the high computational cost of EMD, we propose an approximation that reduces the cost to linear time. Given the large size of our dataset a fast approximation is crucial for this application. Other notions of distances such as the information theoretic Kullback-Leibler divergence and statistical χ2 distance, account only for the correspondence between bins with the same index, and do not use information across bins, and are sensitive to bin size. A cross-bin distance measure like EMD is not affected by binning differences and meaningfully matches the perceptual notion of "nearness". Our technique is simple, efficient and practical for clustering distributions. We demonstrate the use of EMD on a real-world application of analyzing 411,550 anonymous mobility usage patterns which are defined as distributions over a manifold. EMD allows us to represent inherent relationships in this space, and enables us to successfully cluster even sparse signatures.

【Keywords】: clustering; distributions; earth mover distance (emd); signatures

77. Online heterogeneous mixture modeling with marginal and copula selection.

【Paper Link】【Pages】:645-653

【Authors】: Ryohei Fujimaki ; Yasuhiro Sogawa ; Satoshi Morinaga

【Abstract】: This paper proposes an online mixture modeling methodology in which individual components can have different marginal distributions and dependency structures. Mixture models have been widely studied and applied to various application areas, including density estimation, fraud/failure detection, image segmentation, etc. Previous research has been almost exclusively focused on mixture models having components of a single type (e.g., a Gaussian mixture model.) However, recent growing needs for complicated data modeling necessitate the use of more flexible mixture models (e.g., a mixture of a lognormal distribution for medical costs and a Gaussian distribution for blood pressure, for medical analytics.) Our key ideas include: 1) separating marginal distributions and their dependencies using copulas and 2) online extension of a recently-developed "expectation minimization of description length," which enable us to efficiently learn types of both marginal distributions and copulas as well as their parameters. The proposed method provides not only good performance in applications, but also scalable, automatic model selection, which greatly reduces the intensive modeling costs in data mining processes. We show that the proposed method outperforms state-of-the-art methods in application to density estimation and to anomaly detection.

【Keywords】: copula; expectation minimization of description length; heterogeneous mixture model; online model selection

Graph mining 3

78. Dual active feature and sample selection for graph classification.

【Paper Link】【Pages】:654-662

【Authors】: Xiangnan Kong ; Wei Fan ; Philip S. Yu

【Abstract】: Graph classification has become an important and active research topic in the last decade. Current research on graph classification focuses on mining discriminative subgraph features under supervised settings. The basic assumption is that a large number of labeled graphs are available. However, labeling graph data is quite expensive and time consuming for many real-world applications. In order to reduce the labeling cost for graph data, we address the problem of how to select the most important graph to query for the label. This problem is challenging and different from conventional active learning problems because there is no predefined feature vector. Moreover, the subgraph enumeration problem is NP-hard. The active sample selection problem and the feature selection problem are correlated for graph data. Before we can solve the active sample selection problem, we need to find a set of optimal subgraph features. To address this challenge, we demonstrate how one can simultaneously estimate the usefulness of a query graph and a set of subgraph features. The idea is to maximize the dependency between subgraph features and graph labels using an active learning framework. We propose a branch-and-bound algorithm to search for the optimal query graph and optimal features simultaneously. Empirical studies on nine real-world tasks demonstrate that the proposed method can obtain better accuracy on graph data than alternative approaches.

【Keywords】: active learning; data mining; feature construction; feature selection; graph classification; subgraph pattern

79. It's who you know: graph mining using recursive structural features.

【Paper Link】【Pages】:663-671

【Authors】: Keith Henderson ; Brian Gallagher ; Lei Li ; Leman Akoglu ; Tina Eliassi-Rad ; Hanghang Tong ; Christos Faloutsos

【Abstract】: Given a graph, how can we extract good features for the nodes? For example, given two large graphs from the same domain, how can we use information in one to do classification in the other (i.e., perform across-network classification or transfer learning on graphs)? Also, if one of the graphs is anonymized, how can we use information in one to de-anonymize the other? The key step in all such graph mining tasks is to find effective node features. We propose ReFeX (Recursive Feature eXtraction), a novel algorithm, that recursively combines local (node-based) features with neighborhood (egonet-based) features; and outputs regional features -- capturing "behavioral" information. We demonstrate how these powerful regional features can be used in within-network and across-network classification and de-anonymization tasks -- without relying on homophily, or the availability of class labels. The contributions of our work are as follows: (a) ReFeX is scalable and (b) it is effective, capturing regional ("behavioral") information in large graphs. We report experiments on real graphs from various domains with over 1M edges, where ReFeX outperforms its competitors on typical graph mining tasks like network classification and de-anonymization.

【Keywords】: feature extraction; graph mining; identity resolution; network classification

80. Triangle listing in massive networks and its applications.

【Paper Link】【Pages】:672-680

【Authors】: Shumo Chu ; James Cheng

【Abstract】: Triangle listing is one of the fundamental algorithmic problems whose solution has numerous applications especially in the analysis of complex networks, such as the computation of clustering coefficient, transitivity, triangular connectivity, etc. Existing algorithms for triangle listing are mainly in-memory algorithms, whose performance cannot scale with the massive volume of today's fast growing networks. When the input graph cannot fit into main memory, triangle listing requires random disk accesses that can incur prohibitively large I/O cost. Some streaming and sampling algorithms have been proposed but these are approximation algorithms. We propose an I/O-efficient algorithm for triangle listing. Our algorithm is exact and avoids random disk access. Our results show that our algorithm is scalable and outperforms the state-of-the-art local triangle estimation algorithm.

【Keywords】: clustering coefficient; large graphs; massive networks; triangle counting; triangle listing

Scalability 3

81. Fast clustering using MapReduce.

【Paper Link】【Pages】:681-689

【Authors】: Alina Ene ; Sungjin Im ; Benjamin Moseley

【Abstract】: Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

【Keywords】: MapReduce; approximation algorithms; clustering algorithms; k-center; k-median; parallel algorithms

82. Clustering very large multi-dimensional datasets with MapReduce.

【Paper Link】【Pages】:690-698

【Authors】: Robson Leonardo Ferreira Cordeiro ; Caetano Traina Jr. ; Agma Juci Machado Traina ; Julio López ; U. Kang ; Christos Faloutsos

【Abstract】: Given a very large moderate-to-high dimensionality dataset, how could one cluster its points? For datasets that don't fit even on a single disk, parallelism is a first class option. In this paper we explore MapReduce for clustering this kind of data. The main questions are (a) how to minimize the I/O cost, taking into account the already existing data partition (e.g., on disks), and (b) how to minimize the network cost among processing nodes. Either of them may be a bottleneck. Thus, we propose the Best of both Worlds -- BoW method, that automatically spots the bottleneck and chooses a good strategy. Our main contributions are: (1) We propose BoW and carefully derive its cost functions, which dynamically choose the best strategy; (2) We show that BoW has numerous desirable features: it can work with most serial clustering methods as a plugged-in clustering subroutine, it balances the cost for disk accesses and network accesses, achieving a very good tradeoff between the two, it uses no user-defined parameters (thanks to our reasonable defaults), it matches the clustering quality of the serial algorithm, and it has near-linear scale-up; and finally, (3) We report experiments on real and synthetic data with billions of points, using up to 1,024 cores in parallel. To the best of our knowledge, our Yahoo! web is the largest real dataset ever reported in the database subspace clustering literature. Spanning 0.2 TB of multi-dimensional data, it took only 8 minutes to be clustered, using 128 cores.

【Keywords】: clustering; mapreduce; very large multi-dimensional datasets

83. Selective block minimization for faster convergence of limited memory large-scale linear models.

【Paper Link】【Pages】:699-707

【Authors】: Kai-Wei Chang ; Dan Roth

【Abstract】: As the size of data sets used to build classifiers steadily increases, training a linear model efficiently with limited memory becomes essential. Several techniques deal with this problem by loading blocks of data from disk one at a time, but usually take a considerable number of iterations to converge to a reasonable model. Even the best block minimization techniques [1] require many block loads since they treat all training examples uniformly. As disk I/O is expensive, reducing the amount of disk access can dramatically decrease the training time. This paper introduces a selective block minimization (SBM) algorithm, a block minimization method that makes use of selective sampling. At each step, SBM updates the model using data consisting of two parts: (1) new data loaded from disk and (2) a set of informative samples already in memory from previous steps. We prove that, by updating the linear model in the dual form, the proposed method fully utilizes the data in memory and converges to a globally optimal solution on the entire data. Experiments show that the SBM algorithm dramatically reduces the number of blocks loaded from disk and consequently obtains an accurate and stable model quickly on both binary and multi-class classification.

【Keywords】: document classification; large-scale learning; linear classification

Predictive modeling 3

84. Bounded coordinate-descent for biological sequence classification in high dimensional predictor space.

【Paper Link】【Pages】:708-716

【Authors】: Georgiana Ifrim ; Carsten Wiuf

【Abstract】: We present a framework for discriminative sequence classification where linear classifiers work directly in the explicit high-dimensional predictor space of all subsequences in the training set (as opposed to kernel-induced spaces). This is made feasible by employing a gradient-bounded coordinate-descent algorithm for efficiently selecting discriminative subsequences without having to expand the whole space. Our framework can be applied to a wide range of loss functions, including binomial log-likelihood loss of logistic regression and squared hinge loss of support vector machines. When applied to protein remote homology detection and remote fold recognition, our framework achieves comparable performance to the state-of-the-art (e.g., kernel support vector machines). In contrast to state-of-the-art sequence classifiers, our models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem -- a crucial requirement for the bioinformatics and medical communities.

【Keywords】: greedy coordinate-descent; logistic regression; sequence classification; string classification; support vector machines

85. Multi-source domain adaptation and its application to early detection of fatigue.

【Paper Link】【Pages】:717-725

【Authors】: Rita Chattopadhyay ; Jieping Ye ; Sethuraman Panchanathan ; Wei Fan ; Ian Davidson

【Abstract】: We consider the characterization of muscle fatigue through noninvasive sensing mechanism such as surface electromyography (SEMG). While changes in the properties of SEMG signals with respect to muscle fatigue have been reported in the literature, the large variation in these signals across different individuals makes the task of modeling and classification of SEMG signals challenging. Indeed, the variation in SEMG parameters from subject to subject creates differences in the data distribution. In this paper, we propose a transfer learning framework based on the multi-source domain adaptation methodology for detecting different stages of fatigue using SEMG signals, that addresses the distribution differences. In the proposed framework, the SEMG data of a subject represent a domain; data from multiple subjects in the training set form the multiple source domains and the test subject data form the target domain. SEMG signals are predominantly different in conditional probability distribution across subjects. The key feature of the proposed framework is a novel weighting scheme that addresses the conditional probability distribution differences across multiple domains (subjects). We have validated the proposed framework on Surface Electromyogram signals collected from 8 people during a fatigue-causing repetitive gripping activity. Comprehensive experiments on the SEMG data set demonstrate that the proposed method improves the classification accuracy by 20% to 30% over the cases without any domain adaptation method and by 13% to 30% over the existing state-of-the-art domain adaptation methods.

【Keywords】: multi-source domain adaption; subject based variability; surface electromyogram; transfer learning

86. Two-locus association mapping in subquadratic time.

【Paper Link】【Pages】:726-734

【Authors】: Panagiotis Achlioptas ; Bernhard Schölkopf ; Karsten M. Borgwardt

【Abstract】: Genome-wide association studies (GWAS) have not been able to discover strong associations between many complex human diseases and single genetic loci. Mapping these phenotypes to pairs of genetic loci is hindered by the huge number of candidates leading to enormous computational and statistical problems. In GWAS on single nucleotide polymorphisms (SNPs), one has to consider in the order of 1010 to 1014 pairs, which is infeasible in practice. In this article, we give the first algorithm for 2-locus genome-wide association studies that is subquadratic in the number, n, of SNPs. The running time of our algorithm is data-dependent, but large experiments over real genomic data suggest that it scales empirically as n3/2. As a result, our algorithm can easily cope with n ~ 107, i.e., it can efficiently search all pairs of SNPs in the human genome.

【Keywords】: epistasis detection; lightbulb algorithm; two-locus association mapping

Demonstration track 11

87. A taxi business intelligence system.

【Paper Link】【Pages】:735-738

【Authors】: Yong Ge ; Chuanren Liu ; Hui Xiong ; Jian Chen

【Abstract】: The increasing availability of large-scale location traces creates unprecedent opportunities to change the paradigm for knowledge discovery in transportation systems. A particularly promising area is to extract useful business intelligence, which can be used as guidance for reducing inefficiencies in energy consumption of transportation sectors, improving customer experiences, and increasing business performances. However, extracting business intelligence from location traces is not a trivial task. Conventional data analytic tools are usually not customized for handling large, complex, dynamic, and distributed nature of location traces. To that end, we develop a taxi business intelligence system to explore the massive taxi location traces from different business perspectives with various data mining functions. Since we implement the system using the real-world taxi GPS data, this demonstration will help taxi companies to improve their business performances by understanding the behaviors of both drivers and customers. In addition, several identified technical challenges also motivate data mining people to develop more sophisticate techniques in the future.

【Keywords】: business intelligence; route recommendation; taxi driving fraud detection

88. Apolo: interactive large graph sensemaking by combining machine learning and visualization.

【Paper Link】【Pages】:739-742

【Authors】: Duen Horng Chau ; Aniket Kittur ; Jason I. Hong ; Christos Faloutsos

【Abstract】: We present APOLO, a system that uses a mixed-initiative approach to help people interactively explore and make sense of large network datasets. It combines visualization, rich user interaction and machine learning to engage the user in bottom-up sensemaking to gradually build up an understanding over time by starting small, rather than starting big and drilling down. APOLO helps users find relevant information by specifying exemplars, and then using a machine learning method called Belief Propagation to infer which other nodes may be of interest. We demonstrate APOLO's usage and benefits using a Google Scholar citation graph, consisting of 83,000 articles (nodes) and 150,000 citations relationships. A demo video of APOLO is available at http://www.cs.cmu.edu/~dchau/apolo/apolo.mp4.

【Keywords】: belief propagation; large network; sensemaking

89. Article clipper: a system for web article extraction.

【Paper Link】【Pages】:743-746

【Authors】: Jian Fan ; Ping Luo ; Suk Hwan Lim ; Sam Liu ; Joshi Parag ; Jerry Liu

【Abstract】: Many people use the Web as the main source of information in their daily lives. However, most web pages contain non-informative components such as side bars, footers, headers, and advertisements, which are undesirable for certain applications like printing. We demonstrate a system that automatically extracts the informative contents from news- and blog-like web pages. In contrast to many existing methods that are limited to identifying only the text or the bounding rectangular region, our system not only identifies the content but also the structural roles of various content components such as title, paragraphs, images and captions. The structural information enables re-layout of the content in a pleasing way. Besides the article text extraction, our system includes the following components: 1) print-link detection to identify the URL link for printing, and to use it for more reliable analysis and recognition; 2) title detection incorporating both visual cues and HTML tags; 3) image and caption detection utilizing extensive visual cues; 4) multiple-page and next page URL detection. The performance of our system has been thoroughly evaluated using a human labeled ground truth dataset consisting of 2000 web pages from 100 major web sites. We show accurate results using such a dataset.

【Keywords】: information extraction; page segmentation; web article extraction

90. Data intensive analysis on the gordon high performance data and compute system.

【Paper Link】【Pages】:747-748

【Authors】: Robert S. Sinkovits ; Pietro Cicotti ; Shawn Strande ; Mahidhar Tatineni ; Paul Rodriguez ; Nicole Wolter ; Natasha Balac

【Abstract】: The Gordon data intensive computing system was designed to handle problems with large memory requirements that cannot easily be solved using standard workstations or distributed memory supercomputers. We describe the unique features of Gordon that make it ideally suited for data mining and knowledge discovery applications: memory aggregation using the vSMP software solution from ScaleMP, I/O nodes containing 4 TB of low-latency flash memory, and a high performance parallel file system with 4 PB capacity. We also demonstrate how a number of standard data mining tools (e.g. Matlab, WEKA, R) can be used effectively on Dash, an early prototype of the full Gordon system.

【Keywords】: data intensive applications; data mining; flash memory; high-performance computing; machine learning; vsmp

91. Frontex real-time news event extraction framework.

【Paper Link】【Pages】:749-752

【Authors】: Jakub Piskorski ; Martin Atkinson

【Abstract】: An ever-growing amount of information relevant for early detection of certain threats can be extracted from on-line news. This led to an emergence of news mining tools to help analysts to digest the overflow of information and to extract valuable knowledge from on line news sources. This paper gives an overview of the fully operational Real-time News Event Extraction Framework developed for Frontex, the EU Border Agency, to facilitate the process of extracting structured information on border security-related events from on-line news. In particular, a hybrid event extraction system has been constructed, which is applied to the stream of news articles continuously gathered and pre-processed by the Europe Media Monitor - a large-scale multilingual news aggregation engine. The framework consists also of an earth browser, in which events are visualized and an event moderation tool, which allows to access the database of automatically extracted event descriptions and to clean, validate, group, enhance and export them into other knowledge repositories.

【Keywords】: event extraction; multilingual news mining; visualization

【Paper Link】【Pages】:753-756

【Authors】: Xin Jin ; Chi Wang ; Jiebo Luo ; Xiao Yu ; Jiawei Han

【Abstract】: Social media is becoming increasingly ubiquitous and popular on the Internet. Due to the huge popularity of social media websites, such as Facebook, Twitter, YouTube and Flickr, many companies or public figures are now active in maintaining pages on those websites to interact with online users, attracting a large number of fans/followers by posting interesting objects, e.g., (product) photos/videos and text messages. 'Like' has now become a very popular social function by allowing users to express their like of certain objects. It provides an accurate way of estimating user interests and an effective way of sharing/promoting information in social media. In this demo, we propose a system called LikeMiner to mine the power of 'like' in social media networks. We introduce a heterogeneous network model for social media with 'likes', and propose 'like' mining algorithms to estimate representativeness and influence of objects. The implemented prototype system demonstrates the effectiveness of the proposed approach using the large scale Facebook data.

【Keywords】: data mining; influence analysis; information network; like; ranking; recommendation; social media

93. MIME: a framework for interactive visual pattern mining.

【Paper Link】【Pages】:757-760

【Authors】: Bart Goethals ; Sandy Moens ; Jilles Vreeken

【Abstract】: We present a framework for interactive visual pattern mining. Our system enables the user to browse through the data and patterns easily and intuitively, using a toolbox consisting of interestingness measures, mining algorithms and post-processing algorithms to assist in identifying interesting patterns. By mining interactively, we enable the user to combine their subjective interestingness measure and background knowledge with a wide variety of objective measures to easily and quickly mine the most important and interesting patterns. Basically, we enable the user to become an essential part of the mining algorithm. Our demo currently applies to mining interesting itemsets and association rules, and its extension to episodes and decision trees is ongoing.

【Keywords】: interactive visual mining; mime; pattern exploration

94. SIGKDD demo: sensors and software to allow computational entomology, an emerging application of data mining.

【Paper Link】【Pages】:761-764

【Authors】: Gustavo E. A. P. A. Batista ; Eamonn J. Keogh ; Agenor Mafra-Neto ; Edgar Rowton

【Abstract】: The history of humankind is intimately connected to insects. Insect borne diseases kill a million people and destroy tens of billions of dollars worth of crops annually. However, at the same time, beneficial insects pollinate the majority of crop species, and it has been estimated that approximately one third of all food consumed by humans is directly pollinated by bees alone. Given the importance of insects in human affairs, it is somewhat surprising that computer science has not had a larger impact in entomology. We believe that recent advances in sensor technology are beginning change this, and a new field of Computational Entomology will emerge. We will demonstrate an inexpensive sensor that allows us to capture data from flying insects, and the software that allows us to analyze the data. Moreover, we will distribute both the sensors and software for free, to parties willing to take part in a crowdsourcing project on insect classification.

【Keywords】: agricultural; classification; insects; spatiotemporal data

【Paper Link】【Pages】:765-768

【Authors】: Cheng-Te Li ; Shou-De Lin

【Abstract】: This work combines the central ideas from two different areas, crowd simulation and social network analysis, to tackle some existing problems in both areas from a new angle. We present a novel spatio-temporal social crowd simulation framework, Social Flocks, to revisit three essential research problems, (a) generation of social networks, (b) community detection in social networks, (c) modeling collective social behaviors in crowd simulation. Our framework produces social networks that satisfy the properties of high clustering coefficient, low average path length, and power-law degree distribution. It can also be exploited as a novel dynamic model for community detection. Finally our framework can be used to produce real-life collective social behaviors over crowds, including community-guided flocking, leader following, and spatio-social information propagation. Social Flocks can serve as visualization of simulated crowds for domain experts to explore the dynamic effects of the spatial, temporal, and social factors on social networks. In addition, it provides an experimental platform of collective social behaviors for social gaming and movie animations. Social Flocks demo is at http://mslab.csie.ntu.edu.tw/socialflocks/ .

【Keywords】: collective social behaviors; community detection; crowd simulation; flocking; network generation; social networks

【Paper Link】【Pages】:769-772

【Authors】: Jie Tang ; Sen Wu ; Bo Gao ; Yang Wan

【Abstract】: We study the problem of topic-level social network search, which aims to find who are the most influential users in a network on a specific topic and how the influential users connect with each other. We employ a topic model to find topical aspects of each user and a retrieval method to identify influential users by combining the language model and the topic model. An influence maximization algorithm is then presented to find the sub network that closely connects the influential users. Two demonstration systems have been developed and are online available. Empirical analysis based on the user's viewing time and the number of clicks validates the proposed methodologies.

【Keywords】: influence analysis; social networks; social search; topic model

97. Video analytics solution for tracking customer locations in retail shopping malls.

【Paper Link】【Pages】:773-776

【Authors】: Harikrishna G. N. Rai ; Kishore Jonna ; P. Radha Krishna

【Abstract】: Due to increased adoption of digital inclusion in various businesses, location based services are gaining importance to provide value-added services for their customers. In this work, we present a computer vision based system for tracking customer locations by recognizing individual shopping carts inside shopping malls in order to facilitate location based services. We provide an efficient approach for cart recognition that consists of two stages: cart detection and then cart recognition. A binary pattern is placed between two pre-defined color markers and attached to each cart for recognition. The system takes live video feed as input from the cameras mounted on the aisles of the shopping mall and processes frames in real-time. In the cart detection stage, color segmentation, feature extraction and classification are used for detection of binary pattern along with color markers. In recognition stage, segmented binary strip is processed using spatial image processing techniques to decode the cart identification number.

【Keywords】: classification; color segmentation; haar-like features; shopping cart detection

Industrial practice expo track 10

98. "Which half Is wasted?": controlled experiments to measure online-advertising effectiveness.

【Paper Link】【Pages】:777

【Authors】: David H. Reiley

【Abstract】: The department-store retailer John Wanamaker famously stated, "Half the money I spend on advertising is wasted--I just don't know which half." Compared with the measurement of advertising effectiveness in traditional media, online advertisers and publishers have considerable data advantages, including individual-level data on advertising exposures, clicks, searches, and other online user behaviors. However, as I shall discuss in this talk, the science of advertising effectiveness requires more than just quantity of data - even more important is the quality of the data. In particular, in many cases, using various statistical techniques with observational data leads to incorrect measurements. To measure the true causal effects, we run controlled experiments that suppress advertising to a control group, much like the placebo in a drug trial. With experiments to determine the ground truth, we can show that in many circumstances, observational-data techniques rely on identifying assumptions that prove to be incorrect, and they produce estimates differing wildly from the truth. Despite increases in data availability, Wanamaker's complaint remains just as true for online advertising as it was for print advertising a century ago. In this talk, I will discuss recent advances in running randomized experiments online, measuring the impact of online display advertising on consumer behavior. Interesting results include the measurable effects of online advertising on offline transactions, the impact on viewers who do not click the ads, the surprisingly large effects of frequency of exposure, and the heterogeneity of advertising effectiveness across users in different demographic groups or geographic locations. I also show that sample sizes of a million or more customers may be necessary to get enough precision for statistical significance of economically important effects - so we have just reached the cusp of being able to measure effects precisely with present technology. (By comparison, previous controlled experiments using split-cable TV systems, with sample sizes in the mere thousands, have lacked statistical power to measure precise effects for a given campaign.) As I show with several examples that establish the ground truth using controlled experiments, the bias in observational studies can be extremely large, over-or-underestimating the true causal effects by an order of magnitude. I will discuss the (implicit or explicit) modeling assumptions made by researchers using observational data, and identify several reasons why these assumptions are violated in practice. I will also discuss future directions in using experiments to measure advertising effectiveness.

【Keywords】: advertising effectiveness; computational advertising; data mining; experimentation.

99. Accelerating large-scale data mining using in-database analytics.

【Paper Link】【Pages】:778

【Authors】: Mario E. Inchiosa

【Abstract】: In more and more industries, competitive advantage hinges on exploiting the largest quantity of data in the shortest possible time - and doing so cost-effectively. Data volumes are growing exponentially, while businesses are striving to deploy sophisticated and computationally intensive predictive analytics. Often, massive data is stored in a data warehouse running on dedicated parallel hardware, but advanced analytics is performed on a separate compute platform. Moving data from the data warehouse to the compute environment can constitute a significant bottleneck. Organizations resort to considering only a fraction of their data or refreshing their analyses infrequently. To address the data movement bottleneck and take full advantage of parallel data warehouse platforms, vendors are offering new in-database analytics capabilities. They are opening up their platforms, allowing users to run their own user-defined functions and statistical models as well as vendor- and partner-supplied advanced analytics on the database platform, close to the data, in parallel, without transporting the data through a host node or corporate network. In this talk, we will present the need for in-database analytics and discuss a number of the new solutions available, highlighting case studies where solution times have been reduced from hours to minutes or seconds.

【Keywords】: in-database analytics; large-scale data warehousing; predictive analytics

100. Applications of data mining and machine learning in online customer care.

【Paper Link】【Pages】:779

【Authors】: Ravi Vijayaraghavan ; P. V. Kannan

【Abstract】: With the coming of age of web as a mainstream customer service channel, B2C companies have invested substantial resources in enhancing their web presence. Today customers can interact with a company through channels such as phone, chat, email, social media or web self-service. With the availability of web logs, CRM data and text transcripts these online channels are rich with data and they track several aspects of customer behavior and intent. 24/7 Customer Innovation Labs has developed a series of data mining and statistics driven solutions to improve customer experience in each of these online channels. This talk will focus on solutions to enhance performance of web chat as a customer service channel. 2 stages of customer life-cycle will be considered -- new customer acquisition (or sales) and service of existing customers. In customer acquisition the key objective is to maximize "incremental" revenues via chat. While in customer service the objective is to drive up the quality of customer experience (measured by customer satisfaction surveys or mined customer sentiments) through chat. The solution based on machine learning methods involves: Real-time targeting of the right visitors to chat Predicting customer needs Routing customer to the right customer service agent Mining chat transcripts and Social Media Portals to identify key customer issues and customer sentiments Mining agents' responses for performance improvement Feeding back learning from 4 and 5 to 1 (better targeting) Real-life case studies will be presented to show how that this closed loop solution can quickly improve key metrics.

【Keywords】: crm; customer acquisition; data mining; online customer support.; predictive systems

101. Broad scale predictive modeling and marketing optimization in retail sales.

【Paper Link】【Pages】:780

【Authors】: Dan Steinberg ; Felipe Fernandez Martinez

【Abstract】: The challenge of predicting retail sales on a product-by-product basis throughout a network of retail stores has been researched intensively by applied econometricians and statisticians for decades. The principal tools of analysis have been linear regression with Bayesian inspired adjustments to stabilize demand curve estimates. The scale of such analytics can be challenging as retailers often work with more than 100,000 products (SKUs) and typically operate networks of hundreds of brick and mortar stores. Department and grocery stores are excellent examples but fast food restaurants also require such detailed predictive modeling systems. Depending on the objectives of the company, predictions may be required for blocks of time spanning a week or more, or, as in the case of fast food operators, predictions are required for each 15-minute time interval of the operating day. The authors have modernized industry standard approaches to such predictive modeling by leveraging advanced data mining techniques. These techniques are more adept in detecting nonlinear response and accommodating interactions and automatically sifting through hundreds if not thousands of potential factors influencing sales outcomes. Results show that conventional statistical models miss a substantial fraction of the explainable variance while the new methods dominate in terms of performance and speed of model development. Accurate prediction is required for reliable planning and logistics, and optimization. Optimization with respect to pricing, promotion and assortment can be asked for relative to a variety of objectives (e.g. revenue, profits) and short term and long-term optimization may result in different decisions being taken. A unique challenge for retailers is the large number of constraints to which complex retail organizations are subject. Contracts and special understandings with valued suppliers severely constrain a retailer's flexibility. For example, certain products may not be promotable (or discounted) in isolation, and others (say from competitors) may not be promoted jointly, and the costs of goods sold may well depend on the quantities contracted. We discuss how we have resolved such challenges via a cycle of prediction and simulation to develop a flexible high-speed system for handling arbitrary constraints, arbitrary objectives, and achieve new levels of predictive accuracy and reliability.

【Keywords】: inventory management; large-scale data mining; predictive analytics; smart pricing.; supply-chain optimization

102. Knowledge discovery and data mining in pharmaceutical cancer research.

【Paper Link】【Pages】:781

【Authors】: Paul A. Rejto

【Abstract】: Biased and unbiased approaches to develop predictive biomarkers of response to drug treatment will be introduced and their utility demonstrated for cell cycle inhibitors. Opportunities to leverage the growing knowledge of tumors characterized by modern methods to measure DNA and RNA will be shown, including the use of appropriate preclinical models and selection of patients. Furthermore, techniques to identify mechanisms of resistance prior to clinical treatment will be discussed. Prospects for systematic data mining and current barriers to the application of precision medicine in cancer will be reviewed along with potential solutions.

【Keywords】: bio-informatics; computational biology; gene sequencing; micro-arrays; oncology

103. Operational security analytics: doing more with less.

【Paper Link】【Pages】:782

【Authors】: Colleen McCue

【Abstract】: Why just count crime when you can anticipate, prevent and respond more effectively? Companies in the commercial sector have long understood the importance of being able to anticipate or predict future behavior and demand in order to respond efficiently and effectively. Embracing the promise of predictive analytics, the public safety community is moving from a focus on "what happened," to a system that enables the ability to anticipate future events and effectively deploy resources in front of crime; thereby, changing outcomes. While we have become familiar with the use of advanced analytics in support of fraud detection and prevention, techniques similar to those used to support customer loyalty programs and supply chain management have been used to prevent and solve violent crimes, enhance investigative pace and efficacy, support information-based risk and threat assessment, and deploy public safety resources more efficiently. As public safety agencies increasingly are asked to do more with less, the ability to anticipate crime represents a game changing paradigm shift; enabling information-based tactics, strategy and policy in support of prevention and response. Reporting, collecting and compiling data are necessary but not sufficient to increasing public safety. Ultimately, the ability to anticipate, prevent and respond more effectively will enable us to do more with less and change public safety outcomes.

【Keywords】: algorithms; experimentation; human factors; management; measurement; security

104. Real-time risk control system for CNP (card not present).

【Paper Link】【Pages】:783

【Authors】: Tai Hsu

【Abstract】: AliExpress is an online e-commerce platform for wholesale products. Credit card is one of its various payment methods. An online transaction using credit cards is called a "card not present" (CNP) transaction where the physical card has not been swiped into a reader. It's also the major type of credit card frauds causing a great overhead of the online operation, sellers, and buyers. To protect customers on our platform, we developed a real-time credit card fraud detection system, using the machine learning technologies which allows us to achieve a precision of 97%, at a recall of 80%. With the system, we can provide the best online shopping experience for our customers, without the high risk of online transactions which always result a high operational cost. We will briefly share our experience and practice in the expo.

【Keywords】: data mining in e-commerce; financial risk modeling.; fraud detection

105. The power of analysis and data.

【Paper Link】【Pages】:784

【Authors】: David Norton

【Abstract】: Caesars Entertainment, the largest provider of branded casino entertainment, captures a wealth of data for 40 million+ customers through its Total Rewards program. In-depth data analysis has helped Caesars weather the economic downturn by prioritizing marketing spend, expense savings targets and identifying new revenue opportunities. This talk will describe how closed-loop marketing, state-of-the-art user segmentation, and ongoing experimentation via test and control groups have enabled Caesars Entertainment to achieve all-time high customer satisfaction scores and outperform the competition in a challenging economic climate. The lessons learned are generic and apply across multiple industries. Insights will also be provided on the next wave of challenges to be answered analytically.

【Keywords】: large-scale data mining; market segmentation; predictive analytics; revenue optimization

106. The practitioner's viewpoint to data mining: key lessons learned in the trenches and case studies.

【Paper Link】【Pages】:785

【Authors】: Richard Boire

【Abstract】: In many data mining exercises, we see information that appears on the surface to demonstrate a particular conclusion. But closer examination of the data reveals that these results are indeed misleading. In this session, we will examine this notion of misleading results in three areas: Statistical Issues Statistical issues such as multicollinearity and outliers can impact results dramatically. We will first outline how these statistical issues can provide misleading results. At the same time, we will demonstrate how the data mining practitioner overcomes these issues through data analysis approaches that provide both more meaningful and non-misleading results to the business community. Overstating of Results From a business standpoint, we will also look at results that appear to be too good to be true. In other words, there appears to be some overstating of results within a given data mining solution. Initially, we will discuss how to identify these situations. Secondly, we will outline what causes this overstatement of results and detail our approach on how we would overcome this predicament. Overfitting Another topic for discussion is overfitting of results. This is particularly the case when building predictive models. In this section of the seminar, we will define what overfitting is and why it is becoming more relevant for understanding by the business community. Once again, analytical approaches will be discussed in terms of how to best handle this issue. We present two case studies that demonstrate how our principled 4-step approach can be used to solve challenging data mining problems. These 4 steps are as follows: How to identify the problem How we construct the right data environment to conduct our analytics What kind of analytics are employed which include techniques such as correlation analysis, EDA reports, logistic regression, and gains charts. More importantly, we discuss how to interpret the output in terms of the actual impact to the business (i.e. increased response rate and ultimately increased ROI.) How do we apply the learning to a future initiative and what were the actual results

【Keywords】: cross-sell and up-sell marketing; data mining; financial services; predictive analytics; roi optimization; segmentation.

107. Thriving as a data miner in the real world.

【Paper Link】【Pages】:786

【Authors】: John F. Elder IV

【Abstract】: Meaningful work is a deep human need. We all yearn to contribute to something greater than ourselves, be listened to, and work alongside friendly peers. Data mining consulting is a powerful way to use technical skills and gain these great side benefits. The power of analytics and its high return on investment makes one's expertise welcome virtually everywhere. And the variety of projects and domains encountered leads to continual learning as new problems are met and solved. Teaching and writing are possible, and there is great satisfaction in seeing one's work actually implemented and used, potentially touching millions. Still, in industry, one has the joy and hazards of working closely with other humans, where final success can depend as much on others as oneself, and on social as well as technical issues. In my experience, business risk strongly outweighs technical risk in whether a solution is used. I will share some hard-won lessons learned on how to best succeed, both technically and socially, in the results-oriented world of industry.

【Keywords】: bio-informatics; crm; data mining; e-commerce; predictive systems

Poster session 70

108. 2D-interval predictions for time series.

【Paper Link】【Pages】:787-794

【Authors】: Luís Torgo ; Orlando Ohashi

【Abstract】: Research on time series forecasting is mostly focused on point predictions - models are obtained to estimate the expected value of the target variable for a certain point in future. However, for several relevant applications this type of forecasts has limited utility (e.g. costumer wallet value estimation, wind and electricity power production, control of water quality, etc.). For these domains it is frequently more important to be able to forecast a range of plausible future values of the target variable. A typical example is wind power production, where it is of high relevance to predict the future wind variability in order to ensure that supply and demand are balanced. This type of predictions will allow timely actions to be taken in order to cope with the expected values of the target variable on a certain future time horizon. In this paper we study this type of predictions - the prediction of a range of expected values for a future time interval. We describe some possible approaches to this task and propose an alternative procedure that our extensive experiments on both artificial and real world domains show to have clear advantages.

【Keywords】: forecasting; prediction; time series

109. A game theoretic framework for heterogenous information network clustering.

【Paper Link】【Pages】:795-804

【Authors】: Faris Alqadah ; Raj Bhatnagar

【Abstract】: Heterogeneous information networks are pervasive in applications ranging from bioinformatics to e-commerce. As a result, unsupervised learning and clustering methods pertaining to such networks have gained significant attention recently. Nodes in a heterogeneous information network are regarded as objects derived from distinct domains such as 'authors' and 'papers'. In many cases, feature sets characterizing the objects are not available, hence, clustering of the objects depends solely on the links and relationships amongst objects. Although several previous studies have addressed information network clustering, shortcomings remain. First, the definition of what constitutes an information network cluster varies drastically from study to study. Second, previous algorithms have generally focused on non-overlapping clusters, while many algorithms are also limited to specific network topologies. In this paper we introduce a game theoretic framework (GHIN) for defining and mining clusters in heterogeneous information networks. The clustering problem is modeled as a game wherein each domain represents a player and clusters are defined as the Nash equilibrium points of the game. Adopting the abstraction of Nash equilibrium points as clusters allows for flexible definition of reward functions that characterize clusters without any modification to the underlying algorithm. We prove that well-established definitions of clusters in 2-domain information networks such as formal concepts, maximal bi-cliques, and noisy binary tiles can always be represented as Nash equilibrium points. Moreover, experimental results employing a variety of reward functions and several real world information networks illustrate that the GHIN framework produces more accurate and informative clusters than the recently proposed NetClus and state of the art MDC algorithms.

【Keywords】: clustering; game theoretic clustering; information networks; multi-way clustering

110. A GPU-tailored approach for training kernelized SVMs.

【Paper Link】【Pages】:805-813

【Authors】: Andrew Cotter ; Nathan Srebro ; Joseph Keshet

【Abstract】: We present a method for efficiently training binary and multiclass kernelized SVMs on a Graphics Processing Unit (GPU). Our methods apply to a broad range of kernels, including the popular Gaus- sian kernel, on datasets as large as the amount of available memory on the graphics card. Our approach is distinguished from earlier work in that it cleanly and efficiently handles sparse datasets through the use of a novel clustering technique. Our optimization algorithm is also specifically designed to take advantage of the graphics hardware. This leads to different algorithmic choices then those preferred in serial implementations. Our easy-to-use library is orders of magnitude faster then existing CPU libraries, and several times faster than prior GPU approaches.

【Keywords】: GPGPU

111. A multi-task learning formulation for predicting disease progression.

【Paper Link】【Pages】:814-822

【Authors】: Jiayu Zhou ; Lei Yuan ; Jun Liu ; Jieping Ye

【Abstract】: Alzheimer's Disease (AD), the most common type of dementia, is a severe neurodegenerative disorder. Identifying markers that can track the progress of the disease has recently received increasing attentions in AD research. A definitive diagnosis of AD requires autopsy confirmation, thus many clinical/cognitive measures including Mini Mental State Examination (MMSE) and Alzheimer's Disease Assessment Scale cognitive subscale (ADAS-Cog) have been designed to evaluate the cognitive status of the patients and used as important criteria for clinical diagnosis of probable AD. In this paper, we propose a multi-task learning formulation for predicting the disease progression measured by the cognitive scores and selecting markers predictive of the progression. Specifically, we formulate the prediction problem as a multi-task regression problem by considering the prediction at each time point as a task. We capture the intrinsic relatedness among different tasks by a temporal group Lasso regularizer. The regularizer consists of two components including an L2,1-norm penalty on the regression weight vectors, which ensures that a small subset of features will be selected for the regression models at all time points, and a temporal smoothness term which ensures a small deviation between two regression models at successive time points. We have performed extensive evaluations using various types of data at the baseline from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database for predicting the future MMSE and ADAS-Cog scores. Our experimental studies demonstrate the effectiveness of the proposed algorithm for capturing the progression trend and the cross-sectional group differences of AD severity. Results also show that most markers selected by the proposed algorithm are consistent with findings from existing cross-sectional studies.

【Keywords】: Alzheimer's disease; cognitive score; group lasso; multi-task learning; regression; stability selection

112. A simple statistical model and association rule filtering for classification.

【Paper Link】【Pages】:823-831

【Authors】: György J. Simon ; Vipin Kumar ; Peter W. Li

【Abstract】: Associative classification is a predictive modeling technique that constructs a classifier based on class association rules (also known as predictive association rules; PARs). PARs are association rules where the consequence of the rule is a class label. Associative classification has gained substantial research attention because it successfully joins the benefits of association rule mining with classification. These benefits include the inherent ability of association rule mining to extract high-order interactions among the predictors--an ability that many modern classifiers lack--and also the natural interpretability of the individual PARs. Associative classification is not without its caveats. Association rule mining often discovers a combinatorially large number of association rules, eroding the interpretability of the rule set. Extensive effort has been directed towards developing interestingness measures, which filter (predictive) association rules after they have been generated. These interestingness measures, albeit very successful at selecting interesting rules, lack two features that are highly valuable in the context of classification. First, only few of the interestingness measures are rooted in a statistical model. Given the distinction between a training and a test data set in the classification setting, the ability to make statistical inferences about the performance of the predictive classification rules on the test set is highly desirable. Second, the unfiltered set of predictive assocation rules (PARs) are often redundant, we can prove that certain PARs will not be used to construct a classification model given the presence of other PARs. In this paper, we propose a simple statistical model towards making inferences on the test set about the various performance metrics of predictive association rules. We also derive three filtering criteria based on hypothesis testing, which are very selective (reduce the number of PARs to be considered by the classifier by several orders of magnitude), yet do not effect the performance of the classification adversely. In the case, where the classification model is constructed as a logistic model on top of the PARs, we can mathematically prove, that the filtering criteria do not significantly effect the classifier's performance. We also demonstrate empirically on three publicly available data sets that the vast reduction in the number of PARs indeed did not come at the cost of reducing the predictive performance.

【Keywords】: association rule mining; classification; filtering; statistical significance

113. A time-dependent topic model for multiple text streams.

【Paper Link】【Pages】:832-840

【Authors】: Liangjie Hong ; Byron Dom ; Siva Gurumurthy ; Kostas Tsioutsiouliklis

【Abstract】: In recent years social media have become indispensable tools for information dissemination, operating in tandem with traditional media outlets such as newspapers, and it has become critical to understand the interaction between the new and old sources of news. Although social media as well as traditional media have attracted attention from several research communities, most of the prior work has been limited to a single medium. In addition temporal analysis of these sources can provide an understanding of how information spreads and evolves. Modeling temporal dynamics while considering multiple sources is a challenging research problem. In this paper we address the problem of modeling text streams from two news sources - Twitter and Yahoo! News. Our analysis addresses both their individual properties (including temporal dynamics) and their inter-relationships. This work extends standard topic models by allowing each text stream to have both local topics and shared topics. For temporal modeling we associate each topic with a time-dependent function that characterizes its popularity over time. By integrating the two models, we effectively model the temporal dynamics of multiple correlated text streams in a unified framework. We evaluate our model on a large-scale dataset, consisting of text streams from both Twitter and news feeds from Yahoo! News. Besides overcoming the limitations of existing models, we show that our work achieves better perplexity on unseen data and identifies more coherent topics. We also provide analysis of finding real-world events from the topics obtained by our model.

【Keywords】: news; temporal dynamics; text streams; topic models; twitter

114. Active learning for node classification in assortative and disassortative networks.

【Paper Link】【Pages】:841-849

【Authors】: Cristopher Moore ; Xiaoran Yan ; Yaojia Zhu ; Jean-Baptiste Rouquier ; Terran Lane

【Abstract】: In many real-world networks, nodes have class labels or variables that affect the network's topology. If the topology of the network is known but the labels of the nodes are hidden, we would like to select a small subset of nodes such that, if we knew their labels, we could accurately predict the labels of all the other nodes. We develop an active learning algorithm for this problem which uses information-theoretic techniques to choose which nodes to explore. We test our algorithm on networks from three different domains: a social network, a network of English words that appear adjacently in a novel, and a marine food web. Our algorithm makes no initial assumptions about how the groups connect, and performs well even when faced with quite general types of network structure. In particular, we do not assume that nodes of the same class are more likely to be connected to each other - only that they connect to the rest of the network in similar ways.

【Keywords】: active learning; collective classification; community detection; complex networks; information theory; structure and function; transductive graph labeling

115. Active learning using on-line algorithms.

【Paper Link】【Pages】:850-858

【Authors】: Chris Mesterharm ; Michael J. Pazzani

【Abstract】: This paper describes a new technique and analysis for using on-line learning algorithms to solve active learning problems. Our algorithm is called Active Vote, and it works by actively selecting instances that force several perturbed copies of an on-line algorithm to make mistakes. The main intuition for our result is based on the fact that the number of mistakes made by the optimal on-line algorithm is a lower bound on the number of labels needed for active learning. We provide performance bounds for Active Vote in both a batch and on-line model of active learning. These performance bounds depend on the algorithm having a set of unlabeled instances in which the various perturbed on-line algorithms disagree. The motivating application for Active Vote is an Internet advertisement rating program. We conduct experiments using data collected for this advertisement problem along with experiments using standard datasets. We show Active Vote can achieve an order of magnitude decrease in the number of labeled instances over various passive learning algorithms such as Support Vector Machines.

【Keywords】: active learning; internet advertisement; online learning

116. Algorithms for speeding up distance-based outlier detection.

【Paper Link】【Pages】:859-867

【Authors】: Kanishka Bhaduri ; Bryan L. Matthews ; Chris Giannella

【Abstract】: The problem of distance-based outlier detection is difficult to solve efficiently in very large datasets because of potential quadratic time complexity. We address this problem and develop sequential and distributed algorithms that are significantly more efficient than state-of-the-art methods while still guaranteeing the same outliers. By combining simple but effective indexing and disk block accessing techniques, we have developed a sequential algorithm iOrca that is up to an order-of-magnitude faster than the state-of-the-art. The indexing scheme is based on sorting the data points in order of increasing distance from a fixed reference point and then accessing those points based on this sorted order. To speed up the basic outlier detection technique, we develop two distributed algorithms (DOoR and iDOoR) for modern distributed multi-core clusters of machines, connected on a ring topology. The first algorithm passes data blocks from each machine around the ring, incrementally updating the nearest neighbors of the points passed. By maintaining a cutoff threshold, it is able to prune a large number of points in a distributed fashion. The second distributed algorithm extends this basic idea with the indexing scheme discussed earlier. In our experiments, both distributed algorithms exhibit significant improvements compared to the state-of-the-art distributed method [13].

【Keywords】: distributed computing; nearest neighbor; outlier detection

117. An effective evaluation measure for clustering on evolving data streams.

【Paper Link】【Pages】:868-876

【Authors】: Hardy Kremer ; Philipp Kranen ; Timm Jansen ; Thomas Seidl ; Albert Bifet ; Geoff Holmes ; Bernhard Pfahringer

【Abstract】: Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.

【Keywords】: evaluation measure; stream clustering

118. An iterated graph laplacian approach for ranking on manifolds.

【Paper Link】【Pages】:877-885

【Authors】: Xueyuan Zhou ; Mikhail Belkin ; Nathan Srebro

【Abstract】: Ranking is one of the key problems in information retrieval. Recently, there has been significant interest in a class of ranking algorithms based on the assumption that data is sampled from a low dimensional manifold embedded in a higher dimensional Euclidean space. In this paper, we study a popular graph Laplacian based ranking algorithm [23] using an analytical method, which provides theoretical insights into the ranking algorithm going beyond the intuitive idea of "diffusion." Our analysis shows that the algorithm is sensitive to a commonly used parameter due to the use of symmetric normalized graph Laplacian. We also show that the ranking function may diverge to infinity at the query point in the limit of infinite samples. To address these issues, we propose an improved ranking algorithm on manifolds using Green's function of an iterated unnormalized graph Laplacian, which is more robust and density adaptive, as well as pointwise continuous in the limit of infinite samples. We also for the first time in the ranking literature empirically explore two variants from a family of twice normalized graph Laplacians. Experimental results on text and image data support our analysis, which also suggest the potential value of twice normalized graph Laplacians in practice.

【Keywords】: green's function; iterated graph laplacian

119. Anomaly localization for network data streams with graph joint sparse PCA.

【Paper Link】【Pages】:886-894

【Authors】: Ruoyi Jiang ; Hongliang Fei ; Jun Huan

【Abstract】: Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) has been extensively applied to detecting anomalies in network data streams. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we propose novel sparse PCA methods to perform anomaly detection and localization for network data streams. Our key observation is that we can localize anomalies by identifying a sparse low dimensional space that captures the abnormal events in data streams. To better capture the sources of anomalies, we incorporate the structure information of the network stream data in our anomaly localization framework. We have performed comprehensive experimental studies of the proposed methods, and have compared our methods with the state-ofthe-art using three real-world data sets from different application domains. Our experimental studies demonstrate the utility of the proposed methods.

【Keywords】: PCA; anomaly localization; network data streams; sparse learning

120. Approximate kernel k-means: solution to large scale kernel clustering.

【Paper Link】【Pages】:895-903

【Authors】: Radha Chitta ; Rong Jin ; Timothy C. Havens ; Anil K. Jain

【Abstract】: Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernel-based clustering algorithms can capture the non-linear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel k-means, termed approximate kernel k-means, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with dramatically reduced run-time complexity and memory requirements.

【Keywords】: k-means; kernel clustering; large-scale clustering

121. Ask me better questions: active learning queries based on rule induction.

【Paper Link】【Pages】:904-912

【Authors】: Parisa Rashidi ; Diane J. Cook

【Abstract】: Active learning methods are used to improve the classification accuracy when little labeled data is available. Most traditional active learning methods pose a very specific query to the oracle, i.e. they ask for the label of an unlabeled example. This paper proposes a novel active learning method called RIQY (Rule Induced active learning QuerY). It can construct generic active learning queries based on rule induction from multiple unlabeled instances. These queries are shorter and more readable for the oracle and encompass many similar cases. Also the learning algorithm can achieve higher accuracy rates by asking fewer queries. We evaluate our algorithm on 12 different real datasets. Our results show that we can achieve higher accuracy rates using fewer queries compared to the traditional active learning methods.

【Keywords】: active learning; machine learning

122. Automatically tagging email by leveraging other users' folders.

【Paper Link】【Pages】:913-921

【Authors】: Yehuda Koren ; Edo Liberty ; Yoelle Maarek ; Roman Sandler

【Abstract】: Most email applications devote a significant part of their real estate to organization mechanisms such as folders. Yet, we verified on the Yahoo! Mail service that 70% of email users have never defined a single folder. This implies that one of the most well known email features is underexploited. We propose here to revive the feature by providing a method for generating a lighter form of folders, or tags, benefiting even the most passive users. The method automatically associates, whenever possible, an appropriate semantic tag with a given email. This gives rise to an alternate mechanism for organizing and searching email. We advocate a novel modeling approach that exploits the overall population of users, thereby learning from the wisdom-of-crowds how to categorize messages. Given our massive user base, it is enough to learn from a minority of the users who label certain messages in order to label that kind of messages for the general population. We design a novel cascade classification approach, which copes with the severe scalability and accuracy constraints we are facing. Significant efficiency gains are achieved by working within a low dimensional latent space, and by using a novel hierarchical classifier. Precision level is controlled by separating the task into a two-phase classification process. We performed an extensive empirical study covering three different time periods, over 100 million messages, and thousands of candidate tags per message. The results are encouraging and compare favorably with alternative approaches. Our method successfully tags 72% of incoming email traffic. Performance-wise, the computational overhead, even on surge large traffic, is sufficiently low for our approach to be applicable in production on any large Web mail service.

【Keywords】: automatic foldering; email classification; hierarchical classifier; multi-label classification; multiclass classification

123. Axiomatic ranking of network role similarity.

【Paper Link】【Pages】:922-930

【Authors】: Ruoming Jin ; Victor E. Lee ; Hui Hong

【Abstract】: A key task in analyzing social networks and other complex networks is role analysis: describing and categorizing nodes by how they interact with other nodes. Two nodes have the same role if they interact with equivalent sets of neighbors. The most fundamental role equivalence is automorphic equivalence. Unfortunately, the fastest algorithm known for graph automorphism is nonpolynomial. Moreover, since exact equivalence is rare, a more meaningful task is measuring the role similarity between any two nodes. This task is closely related to the link-based similarity problem that SimRank addresses. However, SimRank and other existing simliarity measures are not sufficient because they do not guarantee to recognize automorphically or structurally equivalent nodes. This paper makes two contributions. First, we present and justify several axiomatic properties necessary for a role similarity measure or metric. Second, we present RoleSim, a role similarity metric which satisfies these axioms and which can be computed with a simple iterative algorithm. We rigorously prove that RoleSim satisfies all the axiomatic properties and demonstrate its superior interpretative power on both synthetic and real datasets.

【Keywords】: automorphic equivalence; complex network; ranking; role similarity; social network; vertex similarity

124. Brain effective connectivity modeling for alzheimer's disease by sparse gaussian bayesian network.

【Paper Link】【Pages】:931-939

【Authors】: Shuai Huang ; Jing Li ; Jieping Ye ; Adam Fleisher ; Kewei Chen ; Teresa Wu ; Eric Reiman

【Abstract】: Recent studies have shown that Alzheimer's disease (AD) is related to alteration in brain connectivity networks. One type of connectivity, called effective connectivity, defined as the directional relationship between brain regions, is essential to brain function. However, there have been few studies on modeling the effective connectivity of AD and characterizing its difference from normal controls (NC). In this paper, we investigate the sparse Bayesian Network (BN) for effective connectivity modeling. Specifically, we propose a novel formulation for the structure learning of BNs, which involves one L1-norm penalty term to impose sparsity and another penalty to ensure the learned BN to be a directed acyclic graph - a required property of BNs. We show, through both theoretical analysis and extensive experiments on eleven moderate and large benchmark networks with various sample sizes, that the proposed method has much improved learning accuracy and scalability compared with ten competing algorithms. We apply the proposed method to FDG-PET images of 42 AD and 67 NC subjects, and identify the effective connectivity models for AD and NC, respectively. Our study reveals that the effective connectivity of AD is different from that of NC in many ways, including the global-scale effective connectivity, intra-lobe, inter-lobe, and inter-hemispheric effective connectivity distributions, as well as the effective connectivity associated with specific brain regions. These findings are consistent with known pathology and clinical progression of AD, and will contribute to AD knowledge discovery.

【Keywords】: alzheimer's disease; bayesian network; brain network; fdg-pet; neuroimaging; sparse learning

125. Classification of functional magnetic resonance imaging data using informative pattern features.

【Paper Link】【Pages】:940-946

【Authors】: Francisco Pereira ; Matthew Botvinick

【Abstract】: The canonical technique for analyzing functional magnetic resonance imaging (fMRI) data, statistical parametric mapping, produces maps of brain locations that are more active during performance of a task than during a control condition. In recent years, there has been increasing awareness of the fact that there is information in the entire pattern of brain activation and not just in saliently active locations. Classifiers have been the tool of choice for capturing this information and used to make predictions ranging from what kind of object a subject is thinking about to what decision they will make. Such classifiers are usually trained on a selection of voxels from the 3D grid that makes up the activation pattern; often this means the best accuracy is obtained using few voxels, from all across the brain, and that different voxels will be chosen in different cross-validation folds, making the classifiers hard to interpret. The increasing commonality of datasets with tens to hundreds of classes makes this problem even more acute. In this paper we introduce a method for identifying informative subsets of adjacent voxels, corresponding to brain patches that distinguish subsets of classes. These patches can then be used to train classifiers for the distinctions they support and used as "pattern features" for a meta-classifier. We show that this method permits classification at a higher accuracy than that obtained with traditional voxel selection, and that the sets of voxels used are more reproducible across cross-validation folds than those identified with voxel selection, and lie in plausible brain locations.

【Keywords】: classification; clustering; feature synthesis; functional MRI; neuroscience

126. Clustering with relative constraints.

【Paper Link】【Pages】:947-955

【Authors】: Eric Yi Liu ; Zhaojun Zhang ; Wei Wang

【Abstract】: Recent studies have suggested using relative distance comparisons as constraints to represent domain knowledge. A natural extension to relative comparisons is the combination of two comparisons defined on the same set of three instances. Constraints in this form, termed Relative Constraints, provide a unified knowledge representation for both partitional and hierarchical clusterings. But many key properties of relative constraints remain unknown. In this paper, we answer the following important questions that enable the broader application of relative constraints in general clustering problems: " Feasibility: Does there exist a clustering that satisfies a given set of relative constraints? (consistency of constraints) "Completeness: Given a set of consistent relative constraints, how can one derive a complete clustering without running into dead-ends? " Informativeness: How can one extract the most informative relative constraints from given knowledge sources? We show that any hierarchical domain knowledge can be easily represented by relative constraints. We further present a hierarchical algorithm that finds a clustering satisfying all given constraints in polynomial time. Experiments showed that our algorithm achieves significantly higher accuracy than the existing metric learning approach based on relative comparisons.

【Keywords】: constrained clustering; hierarchical clustering; relative constraints

127. Common component analysis for multiple covariance matrices.

【Paper Link】【Pages】:956-964

【Authors】: Huahua Wang ; Arindam Banerjee ; Daniel Boley

【Abstract】: We consider the problem of finding a suitable common low dimensional subspace for accurately representing a given set of covariance matrices. With one covariance matrix, this is principal component analysis (PCA). For multiple covariance matrices, we term the problem Common Component Analysis (CCA). While CCA can be posed as a tensor decomposition problem, standard approaches to tensor decompositions have two critical issues: (i) tensor decomposition methods are iterative and rely on the initialization; (ii) for a given level of approximation error, it is difficult to choose a suitable low dimensionality. In this paper, we present a detailed analysis of CCA that yields an effective initialization and iterative algorithms for the problem. The proposed methodology has provable approximation guarantees w.r.t. the global maximum and also allows one to choose the dimensionality for a given level of approximation error. We also establish conditions under which the methodology will achieve the global maximum. We illustrate the effectiveness of the proposed method through extensive experiments on synthetic data as well as on two real stock market datasets, where major financial events can be visualized in low dimensions.

【Keywords】: PCA; dimensionality reduction; parafac; tensor decompositions; tucker

128. Compression of weighted graphs.

【Paper Link】【Pages】:965-973

【Authors】: Hannu Toivonen ; Fang Zhou ; Aleksi Hartikainen ; Atte Hinkka

【Abstract】: We propose to compress weighted graphs (networks), motivated by the observation that large networks of social, biological, or other relations can be complex to handle and visualize. In the process also known as graph simplification, nodes and (unweighted) edges are grouped to supernodes and superedges, respectively, to obtain a smaller graph. We propose models and algorithms for weighted graphs. The interpretation (i.e. decompression) of a compressed, weighted graph is that a pair of original nodes is connected by an edge if their supernodes are connected by one, and that the weight of an edge is approximated to be the weight of the superedge. The compression problem now consists of choosing supernodes, superedges, and superedge weights so that the approximation error is minimized while the amount of compression is maximized. In this paper, we formulate this task as the 'simple weighted graph compression problem'. We then propose a much wider class of tasks under the name of 'generalized weighted graph compression problem'. The generalized task extends the optimization to preserve longer-range connectivities between nodes, not just individual edge weights. We study the properties of these problems and propose a range of algorithms to solve them, with different balances between complexity and quality of the result. We evaluate the problems and algorithms experimentally on real networks. The results indicate that weighted graphs can be compressed efficiently with relatively little compression error.

【Keywords】: compression; graph mining; network; weighted graph

129. Content-driven trust propagation framework.

【Paper Link】【Pages】:974-982

【Authors】: V. G. Vinod Vydiswaran ; ChengXiang Zhai ; Dan Roth

【Abstract】: Existing fact-finding models assume availability of structured data or accurate information extraction. However, as online data gets more unstructured, these assumptions are no longer valid. To overcome this, we propose a novel, content-based, trust propagation framework that relies on signals from the textual content to ascertain veracity of free-text claims and compute trustworthiness of their sources. We incorporate the quality of relevant content into the framework and present an iterative algorithm for propagation of trust scores. We show that existing fact finders on structured data can be modeled as specific instances of this framework. Using a retrieval-based approach to find relevant articles, we instantiate the framework to compute trustworthiness of news sources and articles. We show that the proposed framework helps ascertain trustworthiness of sources better. We also show that ranking news articles based on trustworthiness learned from the content-driven framework is significantly better than baselines that ignore either the content quality or the trust framework.

【Keywords】: credibility; fact-finders; graph algorithms; trust models

130. Cost-aware travel tour recommendation.

【Paper Link】【Pages】:983-991

【Authors】: Yong Ge ; Qi Liu ; Hui Xiong ; Alexander Tuzhilin ; Jian Chen

【Abstract】: Advances in tourism economics have enabled us to collect massive amounts of travel tour data. If properly analyzed, this data can be a source of rich intelligence for providing real-time decision making and for the provision of travel tour recommendations. However, tour recommendation is quite different from traditional recommendations, because the tourist's choice is directly affected by the travel cost, which includes the financial cost and the time. To that end, in this paper, we provide a focused study of cost-aware tour recommendation. Along this line, we develop two cost-aware latent factor models to recommend travel packages by considering both the travel cost and the tourist's interests. Specifically, we first design a cPMF model, which models the tourist's cost with a 2-dimensional vector. Also, in this cPMF model, the tourist's interests and the travel cost are learnt by exploring travel tour data. Furthermore, in order to model the uncertainty in the travel cost, we further introduce a Gaussian prior into the cPMF model and develop the GcPMF model, where the Gaussian prior is used to express the uncertainty of the travel cost. Finally, experiments on real-world travel tour data show that the cost-aware recommendation models outperform state-of-the-art latent factor models with a significant margin. Also, the GcPMF model with the Gaussian prior can better capture the impact of the uncertainty of the travel cost, and thus performs better than the cPMF model.

【Keywords】: cost-aware recommendation; matrix factorization

131. Discovering highly reliable subgraphs in uncertain graphs.

【Paper Link】【Pages】:992-1000

【Authors】: Ruoming Jin ; Lin Liu ; Charu C. Aggarwal

【Abstract】: In this paper, we investigate the highly reliable subgraph problem, which arises in the context of uncertain graphs. This problem attempts to identify all induced subgraphs for which the probability of connectivity being maintained under uncertainty is higher than a given threshold. This problem arises in a wide range of network applications, such as protein-complex discovery, network routing, and social network analysis. Since exact discovery may be computationally intractable, we introduce a novel sampling scheme which enables approximate discovery of highly reliable subgraphs with high probability. Furthermore, we transform the core mining task into a new frequent cohesive set problem in deterministic graphs. Such transformation enables the development of an efficient two-stage approach which combines novel peeling techniques for maximal set discovery with depth-first search for further enumeration. We demonstrate the effectiveness and efficiency of the proposed algorithms on real and synthetic data sets.

【Keywords】: frequent cohesive set; reliable subgraph; uncertain graph

132. Discovering shakers from evolving entities via cascading graph inference.

【Paper Link】【Pages】:1001-1009

【Authors】: Xiaoxiao Shi ; Wei Fan ; Jianping Zhang ; Philip S. Yu

【Abstract】: In an interconnected and dynamic world, the evolution of one entity may cause a series of significant value changes for some others. For example, the currency inflation of Thailand caused the currency slump of other Asian countries, which eventually led to the financial crisis of 1997. We call such high impact entities shakers. To discover shakers, we first introduce the concept of a cascading graph to capture the causality relationships among evolving entities over some period of time, and then infer shakers from the graph. In a cascading graph, nodes represent entities and weighted links represent the causality effects. In order to find hidden shakers in such a graph, two scoring functions are proposed, each of which estimates how much the target entity can affect the values of some others. The idea is to artificially inject a significant change on the target entity, and estimate its direct and indirect influence on the others, by following an inference rule under the Markovian assumption. Both scoring functions are proven to be only dependent on the structure of a cascading graph and can be calculated in polynomial time. Experiments included three datasets in social sciences. Without directly applicable previous methods, we modified three graphical models as baselines. The two proposed scoring functions can effectively capture those high impact entities. For example, in the experiment to discover stock market shakers, the proposed models outperform the three baselines by as much as 50% in accuracy with the ground truth obtained from Yahoo!~Finance.

【Keywords】: cascading graph; dynamic entity; shakers

133. Discovering spatio-temporal causal interactions in traffic data streams.

【Paper Link】【Pages】:1010-1018

【Authors】: Wei Liu ; Yu Zheng ; Sanjay Chawla ; Jing Yuan ; Xie Xing

【Abstract】: The detection of outliers in spatio-temporal traffic data is an important research problem in the data mining and knowledge discovery community. However to the best of our knowledge, the discovery of relationships, especially causal interactions, among detected traffic outliers has not been investigated before. In this paper we propose algorithms which construct outlier causality trees based on temporal and spatial properties of detected outliers. Frequent substructures of these causality trees reveal not only recurring interactions among spatio-temporal outliers, but potential flaws in the design of existing traffic networks. The effectiveness and strength of our algorithms are validated by experiments on a very large volume of real taxi trajectories in an urban road network.

【Keywords】: frequent sub-structures; outlier causalities; spatio-temporal outliers; urban computing and planning

【Paper Link】【Pages】:1019-1027

【Authors】: Panagiotis Papadimitriou ; Hector Garcia-Molina ; Prabhakar Krishnamurthy ; Randall A. Lewis ; David H. Reiley

【Abstract】: We study the impact of display advertising on user search behavior using a field experiment. In such an experiment, the treatment group users are exposed to some display advertising campaign, while the control group users are not. During the campaign and the post-campaign period we monitor the user search queries and we label them as relevant or irrelevant to the campaign using techniques that leverage the bipartite query-URL click graph. Our results indicate that users who are exposed to the advertising campaign submit 5% to 25% more queries that are relevant to it compared to the unexposed users. Using the social graph of the experiment users, we also explore how users are affected by their friends who are exposed to ads. Our results indicate that a user with exposed friends is more likely to submit queries relevant to the campaign, as compared to a user without exposed friends. The result is surprising given that the display advertising campaign that we study does not include any incentive for social action, e.g., discount for recommending friends.

【Keywords】: ad effectiveness; advertising impact; display ads; field experiment; search lift; social influence

135. Diversified ranking on large graphs: an optimization viewpoint.

【Paper Link】【Pages】:1028-1036

【Authors】: Hanghang Tong ; Jingrui He ; Zhen Wen ; Ravi Konuru ; Ching-Yung Lin

【Abstract】: Diversified ranking on graphs is a fundamental mining task and has a variety of high-impact applications. There are two important open questions here. The first challenge is the measure - how to quantify the goodness of a given top-k ranking list that captures both the relevance and the diversity? The second challenge lies in the algorithmic aspect - how to find an optimal, or near-optimal, top-k ranking list that maximizes the measure we defined in a scalable way? In this paper, we address these challenges from an optimization point of view. Firstly, we propose a goodness measure for a given top-k ranking list. The proposed goodness measure intuitively captures both (a) the relevance between each individual node in the ranking list and the query; and (b) the diversity among different nodes in the ranking list. Moreover, we propose a scalable algorithm (linear wrt the size of the graph) that generates a provably near-optimal solution. The experimental evaluations on real graphs demonstrate its effectiveness and efficiency.

【Keywords】: diversity; graph mining; ranking; scalability

136. Entity disambiguation with hierarchical topic models.

【Paper Link】【Pages】:1037-1045

【Authors】: Saurabh Kataria ; Krishnan S. Kumar ; Rajeev Rastogi ; Prithviraj Sen ; Srinivasan H. Sengamedu

【Abstract】: Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model} (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy.

【Keywords】: disambiguation; entity resolution; topic models

【Paper Link】【Pages】:1046-1054

【Authors】: Salvatore Scellato ; Anastasios Noulas ; Cecilia Mascolo

【Abstract】: Link prediction systems have been largely adopted to recommend new friends in online social networks using data about social interactions. With the soaring adoption of location-based social services it becomes possible to take advantage of an additional source of information: the places people visit. In this paper we study the problem of designing a link prediction system for online location-based social networks. We have gathered extensive data about one of these services, Gowalla, with periodic snapshots to capture its temporal evolution. We study the link prediction space, finding that about 30% of new links are added among "place-friends", i.e., among users who visit the same places. We show how this prediction space can be made 15 times smaller, while still 66% of future connections can be discovered. Thus, we define new prediction features based on the properties of the places visited by users which are able to discriminate potential future links among them. Building on these findings, we describe a supervised learning framework which exploits these prediction features to predict new links among friends-of-friends and place-friends. Our evaluation shows how the inclusion of information about places and related user activity offers high link prediction performance. These results open new directions for real-world link recommendation systems on location-based social networks.

【Keywords】: link prediction; location-based services; social networks

138. Fast approximate similarity search based on degree-reduced neighborhood graphs.

【Paper Link】【Pages】:1055-1063

【Authors】: Kazuo Aoyama ; Kazumi Saito ; Hiroshi Sawada ; Naonori Ueda

【Abstract】: This paper presents a fast approximate similarity search method for finding the most similar object to a given query object from an object set with a dissimilarity with a success probability exceeding a given value. As a search index, the proposed method utilizes a degree-reduced k-nearest neighbor (k-DR) graph constructed from the object set with the dissimilarity, and explores the k-DR graph along its edges using a greedy search (GS) algorithm starting from multiple initial vertices with parallel processing. In the graph-construction stage, the structural parameter k of the k-DR graph is determined so that the probability with which at least one search trial of those with multiple initial vertices succeeds is more than the given success probability. To estimate the greedy-search success probability, we introduce the concept of a basin in the k-DR graph. The experimental results on a real data set verify the approximation scheme and high search performance of the proposed method and demonstrate that it is superior to E2LSH in terms of the expected search cost.

【Keywords】: approximate algorithm; index structure; neighborhood graph; similarity search

139. Fast coordinate descent methods with variable selection for non-negative matrix factorization.

【Paper Link】【Pages】:1064-1072

【Authors】: Cho-Jui Hsieh ; Inderjit S. Dhillon

【Abstract】: Nonnegative Matrix Factorization (NMF) is an effective dimension reduction method for non-negative dyadic data, and has proven to be useful in many areas, such as text mining, bioinformatics and image processing. NMF is usually formulated as a constrained non-convex optimization problem, and many algorithms have been developed for solving it. Recently, a coordinate descent method, called FastHals, has been proposed to solve least squares NMF and is regarded as one of the state-of-the-art techniques for the problem. In this paper, we first show that FastHals has an inefficiency in that it uses a cyclic coordinate descent scheme and thus, performs unneeded descent steps on unimportant variables. We then present a variable selection scheme that uses the gradient of the objective function to arrive at a new coordinate descent method. Our new method is considerably faster in practice and we show that it has theoretical convergence guarantees. Moreover when the solution is sparse, as is often the case in real applications, our new method benefits by selecting important variables to update more often, thus resulting in higher speed. As an example, on a text dataset RCV1, our method is 7 times faster than FastHals, and more than 15 times faster when the sparsity is increased by adding an L1 penalty. We also develop new coordinate descent methods when error in NMF is measured by KL-divergence by applying the Newton method to solve the one-variable sub-problems. Experiments indicate that our algorithm for minimizing the KL-divergence is faster than the Lee & Seung multiplicative rule by a factor of 10 on the CBCL image dataset.

【Keywords】: convergence; coordinate descent method; non-negative matrix factorization

140. Fast locality-sensitive hashing.

【Paper Link】【Pages】:1073-1081

【Authors】: Anirban Dasgupta ; Ravi Kumar ; Tamás Sarlós

【Abstract】: Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)-parameterized LSH from O(dkL) to O(dlog d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.

【Keywords】: dimension reduction; hadamard transform; locality sensitive hashing; nearest neighbour search

【Paper Link】【Pages】:1082-1090

【Authors】: Eunjoon Cho ; Seth A. Myers ; Jure Leskovec

【Abstract】: Even though human movement and mobility patterns have a high degree of freedom and variation, they also exhibit structural patterns due to geographic and social constraints. Using cell phone location data, as well as data from two online location-based social networks, we aim to understand what basic laws govern human motion and dynamics. We find that humans experience a combination of periodic movement that is geographically limited and seemingly random jumps correlated with their social networks. Short-ranged travel is periodic both spatially and temporally and not effected by the social network structure, while long-distance travel is more influenced by social network ties. We show that social relationships can explain about 10% to 30% of all human movement, while periodic behavior explains 50% to 70%. Based on our findings, we develop a model of human mobility that combines periodic short range movements with travel due to the social network structure. We show that our model reliably predicts the locations and dynamics of future human movement and gives an order of magnitude better performance than present models of human mobility.

【Keywords】: communication networks; human mobility; social networks

142. GBASE: a scalable and general graph management system.

【Paper Link】【Pages】:1091-1099

【Authors】: U. Kang ; Hanghang Tong ; Jimeng Sun ; Ching-Yung Lin ; Christos Faloutsos

【Abstract】: Graphs appear in numerous applications including cyber-security, the Internet, social networks, protein networks, recommendation systems, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose GBASE, a scalable and general graph management and mining system. The key novelties lie in 1) our storage and compression scheme for a parallel setting and 2) the carefully chosen graph operations and their efficient implementation. We designed and implemented an instance of GBASE using MapReduce/Hadoop. GBASE provides a parallel indexing mechanism for graph mining operations that both saves storage space, as well as accelerates queries. We ran numerous experiments on real graphs, spanning billions of nodes and edges, and we show that our proposed GBASE is indeed fast, scalable and nimble, with significant savings in space and time.

【Keywords】: compression; distributed computing; graph; indexing

【Paper Link】【Pages】:1100-1108

【Authors】: Dashun Wang ; Dino Pedreschi ; Chaoming Song ; Fosca Giannotti ; Albert-László Barabási

【Abstract】: Our understanding of how individual mobility patterns shape and impact the social network is limited, but is essential for a deeper understanding of network dynamics and evolution. This question is largely unexplored, partly due to the difficulty in obtaining large-scale society-wide data that simultaneously capture the dynamical information on individual movements and social interactions. Here we address this challenge for the first time by tracking the trajectories and communication records of 6 Million mobile phone users. We find that the similarity between two individuals' movements strongly correlates with their proximity in the social network. We further investigate how the predictive power hidden in such correlations can be exploited to address a challenging problem: which new links will develop in a social network. We show that mobility measures alone yield surprising predictive power, comparable to traditional network-based measures. Furthermore, the prediction accuracy can be significantly improved by learning a supervised classifier based on combined mobility and network measures. We believe our findings on the interplay of mobility patterns and social ties offer new perspectives on not only link prediction but also network dynamics.

【Keywords】: human mobility; link prediction; social network

144. I want to answer; who has a question?: Yahoo! answers recommender system.

【Paper Link】【Pages】:1109-1117

【Authors】: Gideon Dror ; Yehuda Koren ; Yoelle Maarek ; Idan Szpektor

【Abstract】: Yahoo! Answers is currently one of the most popular question answering systems. We claim however that its user experience could be significantly improved if it could route the "right question" to the "right user." Indeed, while some users would rush answering a question such as "what should I wear at the prom?," others would be upset simply being exposed to it. We argue here that Community Question Answering sites in general and Yahoo! Answers in particular, need a mechanism that would expose users to questions they can relate to and possibly answer. We propose here to address this need via a multi-channel recommender system technology for associating questions with potential answerers on Yahoo! Answers. One novel aspect of our approach is exploiting a wide variety of content and social signals users regularly provide to the system and organizing them into channels. Content signals relate mostly to the text and categories of questions and associated answers, while social signals capture the various user interactions with questions, such as asking, answering, voting, etc. We fuse and generalize known recommendation approaches within a single symmetric framework, which incorporates and properly balances multiple types of signals according to channels. Tested on a large scale dataset, our model exhibits good performance, clearly outperforming standard baselines.

【Keywords】: Yahoo! answers; recommender systems; user models

145. Improving predictions using aggregate information.

【Paper Link】【Pages】:1118-1126

【Authors】: Amit Dhurandhar

【Abstract】: In domains such as consumer products or manufacturing amongst others, we have problems that warrant the prediction of a continuous target. Besides the usual set of explanatory attributes we may also have exact (or approximate) estimates of aggregated targets, which are the sums of disjoint sets of individual targets that we are trying to predict. Hence, the question now becomes can we use these aggregated targets, which are a coarser piece of information, to improve the quality of predictions of the individual targets? In this paper, we provide a simple yet provable way of accomplishing this. In particular, given predictions from any regression model of the target on the test data, we elucidate a provable method for improving these predictions in terms of mean squared error, given exact (or accurate enough) information of the aggregated targets. These estimates of the aggregated targets may be readily available or obtained -- through multilevel regression -- at different levels of granularity. Based on the proof of our method we suggest a criterion for choosing the appropriate level. Moreover, in addition to estimates of the aggregated targets, if we have exact (or approximate) estimates of the mean and variance of the target distribution, then based on our general strategy we provide an optimal way of incorporating this information so as to further improve the quality of predictions of the individual targets. We then validate the results and our claims by conducting experiments on synthetic and real industrial data obtained from diverse domains.

【Keywords】: coarse to fine; hierarchical; regression

146. INCONCO: interpretable clustering of numerical and categorical objects.

【Paper Link】【Pages】:1127-1135

【Authors】: Claudia Plant ; Christian Böhm

【Abstract】: The integrative mining of heterogeneous data and the interpretability of the data mining result are two of the most important challenges of today's data mining. It is commonly agreed in the community that, particularly in the research area of clustering, both challenges have not yet received the due attention. Only few approaches for clustering of objects with mixed-type attributes exist and those few approaches do not consider cluster-specific dependencies between numerical and categorical attributes. Likewise, only a few clustering papers address the problem of interpretability: to explain why a certain set of objects have been grouped into a cluster and what a particular cluster distinguishes from another. In this paper, we approach both challenges by constructing a relationship to the concept of data compression using the Minimum Description Length principle: a detected cluster structure is the better the more efficient it can be exploited for data compression. Following this idea, we can learn, during the run of a clustering algorithm, the optimal trade-off for attribute weights and distinguish relevant attribute dependencies from coincidental ones. We extend the efficient Cholesky decomposition to model dependencies in heterogeneous data and to ensure interpretability. Our proposed algorithm, INCONCO, successfully finds clusters in mixed type data sets, identifies the relevant attribute dependencies, and explains them using linear models and case-by-case analysis. Thereby, it outperforms existing approaches in effectiveness, as our extensive experimental evaluation demonstrates.

【Keywords】: clustering; minimum description length principle; mixed-type data

147. Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach.

【Paper Link】【Pages】:1136-1144

【Authors】: Sean Gilpin ; Ian N. Davidson

【Abstract】: The area of constrained clustering has been actively pursued for the last decade. A more recent extension that will be the focus of this paper is constrained hierarchical clustering which allows building user-constrained dendrograms/trees. Like all forms of constrained clustering, previous work on hierarchical constrained clustering uses simple constraints that are typically implemented in a procedural language. However, there exists mature results and packages in the fields of constraint satisfaction languages and solvers that the constrained clustering field has yet to explore. This work marks the first steps towards introducing constraints satisfaction languages/solvers into hierarchical constrained clustering. We make several significant contributions. We show how many existing and new constraints for hierarchical clustering, can be modeled as a Horn-SAT problem that is easily solvable in polynomial time and which allows their implementation in any number of declarative languages or efficient solvers. We implement our own solver for efficiency reasons. We then show how to formulate constrained hierarchical clustering in a flexible manner so that any number of algorithms, whose output is a dendrogram, can make use of the constraints.

【Keywords】: clustering; constrained clustering; hierarchical clustering

148. Latent graphical models for quantifying and predicting patent quality.

【Paper Link】【Pages】:1145-1153

【Authors】: Yan Liu ; Pei-yun Hseuh ; Rick Lawrence ; Steve Meliksetian ; Claudia Perlich ; Alejandro Veen

【Abstract】: The number of patents filed each year has increased dramatically in recent years, raising concerns that patents of questionable validity are restricting the issuance of truly innovative patents. For this reason, there is a strong demand to develop an objective model to quantify patent quality and characterize the attributes that lead to higher-quality patents. In this paper, we develop a latent graphical model to infer patent quality from related measurements. In addition, we extract advanced lexical features via natural language processing techniques to capture the quality measures such as clarity of claims, originality, and importance of cited prior art. We demonstrate the effectiveness of our approach by validating its predictions with previous court decisions of litigated patents.

【Keywords】: graphical models; measurement models; text mining

149. Logical-shapelets: an expressive primitive for time series classification.

【Paper Link】【Pages】:1154-1162

【Authors】: Abdullah Mueen ; Eamonn J. Keogh ; Neal E. Young

【Abstract】: Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community. Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant. In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation Logical-Shapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics.

【Keywords】: classification; decision tree; information gain; time series

150. Meta optimization and its application to portfolio selection.

【Paper Link】【Pages】:1163-1171

【Authors】: Puja Das ; Arindam Banerjee

【Abstract】: Several data mining algorithms use iterative optimization methods for learning predictive models. It is not easy to determine upfront which optimization method will perform best or converge fast for such tasks. In this paper, we analyze Meta Algorithms (MAs) which work by adaptively combining iterates from a pool of base optimization algorithms. We show that the performance of MAs are competitive with the best convex combination of the iterates from the base algorithms for online as well as batch convex optimization problems. We illustrate the effectiveness of MAs on the problem of portfolio selection in the stock market and use several existing ideas for portfolio selection as base algorithms. Using daily S\&P500 data for the past 21 years and a benchmark NYSE dataset, we show that MAs outperform existing portfolio selection algorithms with provable guarantees by several orders of magnitude, and match the performance of the best heuristics in the pool.

【Keywords】: meta optimization; online learning; portfolio selection

151. Mining closed episodes with simultaneous events.

【Paper Link】【Pages】:1172-1180

【Authors】: Nikolaj Tatti ; Boris Cule

【Abstract】: Sequential pattern discovery is a well-studied field in data mining. Episodes are sequential patterns describing events that often occur in the vicinity of each other. Episodes can impose restrictions to the order of the events, which makes them a versatile technique for describing complex patterns in the sequence. Most of the research on episodes deals with special cases such as serial, parallel, and injective episodes, while discovering general episodes is understudied. In this paper we extend the definition of an episode in order to be able to represent cases where events often occur simultaneously. We present an efficient and novel miner for discovering frequent and closed general episodes. Such a task presents unique challenges. Firstly, we cannot define closure based on frequency. We solve this by computing a more conservative closure that we use to reduce the search space and discover the closed episodes as a postprocessing step. Secondly, episodes are traditionally presented as directed acyclic graphs. We argue that this representation has drawbacks leading to redundancy in the output. We solve these drawbacks by defining a subset relationship in such a way that allows us to remove the redundant episodes. We demonstrate the efficiency of our algorithm and the need for using closed episodes empirically on synthetic and real-world datasets.

【Keywords】: closed episodes; depth-first search; frequent episodes

152. Mining mobility data to minimise travellers' spending on public transport.

【Paper Link】【Pages】:1181-1189

【Authors】: Neal Lathia ; Licia Capra

【Abstract】: As the public transport infrastructure of large cities expands, transport operators are diversifying the range and prices of tickets that can be purchased for travel. However, selecting the best fare for each individual traveller's needs is a complex process that is left almost completely unaided. By examining the relation between urban mobility and fare purchasing habits in large datasets from London, England's public transport network, we estimate that travellers in the city cumulatively spend, per year, up to approximately GBP 200 million more than they need to, as a result of purchasing the incorrect fares. We propose to address these incorrect purchases by leveraging the huge volumes of data that travellers create as they move about the city, by providing, to each of them, personalised ticket recommendations based on their estimated future travel patterns. In this work, we explore the viability of building a fare-recommendation system for public transport networks by (a) formalising the problem as two separate prediction problems and (b) evaluating a number of algorithms that aim to match travellers to the best fare. We find that applying data mining techniques to public transport data has the potential to provide travellers with substantial savings.

【Keywords】: filtering; mobility; public transport; recommender systems

153. Mining mobility user profiles for car pooling.

【Paper Link】【Pages】:1190-1198

【Authors】: Roberto Trasarti ; Fabio Pinelli ; Mirco Nanni ; Fosca Giannotti

【Abstract】: In this paper we introduce a methodology for extracting mobility profiles of individuals from raw digital traces (in particular, GPS traces), and study criteria to match individuals based on profiles. We instantiate the profile matching problem to a specific application context, namely proactive car pooling services, and therefore develop a matching criterion that satisfies various basic constraints obtained from the background knowledge of the application domain. In order to evaluate the impact and robustness of the methods introduced, two experiments are reported, which were performed on a massive dataset containing GPS traces of private cars: (i) the impact of the car pooling application based on profile matching is measured, in terms of percentage shareable traffic; (ii) the approach is adapted to coarser-grained mobility data sources that are nowadays commonly available from telecom operators. In addition the ensuing loss in precision and coverage of profile matches is measured.

【Keywords】: spatio-temporal data mining; trajectory pattern

154. Mining partially annotated images.

【Paper Link】【Pages】:1199-1207

【Authors】: Zhongang Qi ; Ming Yang ; Zhongfei (Mark) Zhang ; Zhengyou Zhang

【Abstract】: In this paper, we study the problem of mining partially annotated images. We first define what the problem of mining partially annotated images is, and argue that in many real-world applications annotated images are typically partially annotated and thus that the problem of mining partially annotated images exists in many situations. We then propose an effective solution to this problem based on a statistical model we have developed called the Semi-Supervised Correspondence Hierarchical Dirichlet Process (SSCHDP). The main idea of this model lies in exploiting the information pertaining to partially annotated images or even unannotated images to achieve semi-supervised learning under the HDP structure. We apply this model to completing the annotations appropriately for partially annotated images in the training data and then to predicting the annotations appropriately and completely for all the unannotated images either in the training data or in any unseen data beyond the training process. Experiments show that SSC-HDP is superior to the peer models from the recent literature when they are applied to solving the problem of mining partially annotated images.

【Keywords】: image annotation completion and prediction; partially annotated training set; semi-supervised learning

155. Multi-view transfer learning with a large margin approach.

【Paper Link】【Pages】:1208-1216

【Authors】: Dan Zhang ; Jingrui He ; Yan Liu ; Luo Si ; Richard D. Lawrence

【Abstract】: Transfer learning has been proposed to address the problem of scarcity of labeled data in the target domain by leveraging the data from the source domain. In many real world applications, data is often represented from different perspectives, which correspond to multiple views. For example, a web page can be described by its contents and its associated links. However, most existing transfer learning methods fail to capture the multi-view {nature}, and might not be best suited for such applications. To better leverage both the labeled data from the source domain and the features from different views, {this paper proposes} a general framework: Multi-View Transfer Learning with a Large Margin Approach (MVTL-LM). On one hand, labeled data from the source domain is effectively utilized to construct a large margin classifier; on the other hand, the data from both domains is employed to impose consistencies among multiple views. As an instantiation of this framework, we propose an efficient optimization method, which is guaranteed to converge to ε precision in O(1/ε) steps. Furthermore, we analyze its error bound, which improves over existing results of related methods. An extensive set of experiments are conducted to demonstrate the advantages of our proposed method over state-of-the-art techniques.

【Keywords】: large margin approach; multi-view learning; transfer learning

156. MultiRank: co-ranking for objects and relations in multi-relational data.

【Paper Link】【Pages】:1217-1225

【Authors】: Michaek Kwok-Po Ng ; Xutao Li ; Yunming Ye

【Abstract】: The main aim of this paper is to design a co-ranking scheme for objects and relations in multi-relational data. It has many important applications in data mining and information retrieval. However, in the literature, there is a lack of a general framework to deal with multi-relational data for co-ranking. The main contribution of this paper is to (i) propose a framework (MultiRank) to determine the importance of both objects and relations simultaneously based on a probability distribution computed from multi-relational data; (ii) show the existence and uniqueness of such probability distribution so that it can be used for co-ranking for objects and relations very effectively; and (iii) develop an efficient iterative algorithm to solve a set of tensor (multivariate polynomial) equations to obtain such probability distribution. Extensive experiments on real-world data suggest that the proposed framework is able to provide a co-ranking scheme for objects and relations successfully. Experimental results have also shown that our algorithm is computationally efficient, and effective for identification of interesting and explainable co-ranking results.

【Keywords】: multi-relational data; ranking; rectangular tensors; stationary probability distribution; transition probability tensors

157. On dynamic data-driven selection of sensor streams.

【Paper Link】【Pages】:1226-1234

【Authors】: Charu C. Aggarwal ; Yan Xie ; Philip S. Yu

【Abstract】: Sensor nodes have limited local storage, computational power, and battery life, as a result of which it is desirable to minimize the storage, processing and communication from these nodes during data collection. The problem is further magnified by the large volumes of data collected. In real applications, sensor streams are often highly correlated with one another or may have other kinds of functional dependencies. For example, a group of sound sensors in a given geographical proximity may pick almost the same set of signals. Clearly, since there are considerable functional dependencies between different sensors, there are huge redundancies in the data collected by sensors. These redundancies may also change as the data evolve over time. In this paper, we discuss real time algorithms for reducing the volume of the data collected in sensor networks. The broad idea is to determine the functional dependencies between sensor streams efficiently in real time, and actively collect the data only from a minimal set of sensors. The remaining sensors collect the data passively at low sampling rates in order to detect any changing trends in the underlying data. We present real time algorithms in order to minimize the power consumption in reducing the data collected and show that the resulting data retains almost the same amount of information at a much lower cost.

【Keywords】: data streams; sensor selection

158. On the privacy of anonymized networks.

【Paper Link】【Pages】:1235-1243

【Authors】: Pedram Pedarsani ; Matthias Grossglauser

【Abstract】: The proliferation of online social networks, and the concomitant accumulation of user data, give rise to hotly debated issues of privacy, security, and control. One specific challenge is the sharing or public release of anonymized data without accidentally leaking personally identifiable information (PII). Unfortunately, it is often difficult to ascertain that sophisticated statistical techniques, potentially employing additional external data sources, are unable to break anonymity. In this paper, we consider an instance of this problem, where the object of interest is the structure of a social network, i.e., a graph describing users and their links. Recent work demonstrates that anonymizing node identities may not be sufficient to keep the network private: the availability of node and link data from another domain, which is correlated with the anonymized network, has been used to re-identify the anonymized nodes. This paper is about conditions under which such a de-anonymization process is possible. We attempt to shed light on the following question: can we assume that a sufficiently sparse network is inherently anonymous, in the sense that even with unlimited computational power, de-anonymization is impossible? Our approach is to introduce a random graph model for a version of the de-anonymization problem, which is parameterized by the expected node degree and a similarity parameter that controls the correlation between two graphs over the same vertex set. We find simple conditions on these parameters delineating the boundary of privacy, and show that the mean node degree need only grow slightly faster than log n with network size n for nodes to be identifiable. Our results have policy implications for sharing of anonymized network information.

【Keywords】: de-anonymization; graph sampling; network privacy; random graphs; social networks

159. Ontology enhancement and concept granularity learning: keeping yourself current and adaptive.

【Paper Link】【Pages】:1244-1252

【Authors】: Shan Jiang ; Lidong Bing ; Bai Sun ; Yan Zhang ; Wai Lam

【Abstract】: As a well-known semantic repository, WordNet is widely used in many applications. However, due to costly edit and maintenance, WordNet's capability of keeping up with the emergence of new concepts is poor compared with on-line encyclopedias such as Wikipedia. To keep WordNet current with folk wisdom, we propose a method to enhance WordNet automatically by merging Wikipedia entities into WordNet, and construct an enriched ontology, named as WorkiNet. WorkiNet keeps the desirable structure of WordNet. At the same time, it captures abundant information from Wikipedia. We also propose a learning approach which is able to generate a tailor-made semantic concept collection for a given document collection. The learning process takes the characteristics of the given document collection into consideration and the semantic concepts in the tailor-made collection can be used as new features for document representation. The experimental results show that the adaptively generated feature space can outperform a static one significantly in text mining tasks, and WorkiNet dominates WordNet most of the time due to its high coverage.

【Keywords】: ontology; tailor-made concept representation learning; workinet

160. Personal privacy vs population privacy: learning to attack anonymization.

【Paper Link】【Pages】:1253-1261

【Authors】: Graham Cormode

【Abstract】: Over the last decade great strides have been made in developing techniques to compute functions privately. In particular, Differential Privacy gives strong promises about conclusions that can be drawn about an individual. In contrast, various syntactic methods for providing privacy (criteria such as k-anonymity and l-diversity) have been criticized for still allowing private information of an individual to be inferred. In this paper, we consider the ability of an attacker to use data meeting privacy definitions to build an accurate classifier. We demonstrate that even under Differential Privacy, such classifiers can be used to infer "private" attributes accurately in realistic data. We compare this to similar approaches for inference-based attacks on other forms of anonymized data. We show how the efficacy of all these attacks can be measured on the same scale, based on the probability of successfully inferring a private attribute. We observe that the accuracy of inference of private attributes for differentially private data and $l$-diverse data can be quite similar.

【Keywords】: anonymization; differential privacy

【Paper Link】【Pages】:1262-1270

【Authors】: Chih-Hua Tai ; Philip S. Yu ; De-Nian Yang ; Ming-Syan Chen

【Abstract】: Due to the rich information in graph data, the technique for privacy protection in published social networks is still in its infancy, as compared to the protection in relational databases. In this paper we identify a new type of attack called a friendship attack. In a friendship attack, an adversary utilizes the degrees of two vertices connected by an edge to re-identify related victims in a published social network data set. To protect against such attacks, we introduce the concept of k2-degree anonymity, which limits the probability of a vertex being re-identified to 1/k. For the k2-degree anonymization problem, we propose an Integer Programming formulation to find optimal solutions in small-scale networks. We also present an efficient heuristic approach for anonymizing large-scale social networks against friendship attacks. The experimental results demonstrate that the proposed approaches can preserve much of the characteristics of social networks.

【Keywords】: anonymization; privacy; social network

162. Probabilistic topic models with biased propagation on heterogeneous information networks.

【Paper Link】【Pages】:1271-1279

【Authors】: Hongbo Deng ; Jiawei Han ; Bo Zhao ; Yintao Yu ; Cindy Xide Lin

【Abstract】: With the development of Web applications, textual documents are not only getting richer, but also ubiquitously interconnected with users and other objects in various ways, which brings about text-rich heterogeneous information networks. Topic models have been proposed and shown to be useful for document analysis, and the interactions among multi-typed objects play a key role at disclosing the rich semantics of the network. However, most of topic models only consider the textual information while ignore the network structures or can merely integrate with homogeneous networks. None of them can handle heterogeneous information network well. In this paper, we propose a novel topic model with biased propagation (TMBP) algorithm to directly incorporate heterogeneous information network with topic modeling in a unified way. The underlying intuition is that multi-typed objects should be treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network. A simple and unbiased topic propagation across such a heterogeneous network does not make much sense. Consequently, we investigate and develop two biased propagation frameworks, the biased random walk framework and the biased regularization framework, for the TMBP algorithm from different perspectives, which can discover latent topics and identify clusters of multi-typed objects simultaneously. We extensively evaluate the proposed approach and compare to the state-of-the-art techniques on several datasets. Experimental results demonstrate that the improvement in our proposed approach is consistent and promising.

【Keywords】: biased propagation; clustering; heterogeneous information network; topic modeling

163. Prominent streak discovery in sequence data.

【Paper Link】【Pages】:1280-1288

【Authors】: Xiao Jiang ; Chengkai Li ; Ping Luo ; Min Wang ; Yong Yu

【Abstract】: This paper studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values. For finding prominent streaks, we make the observation that prominent streaks are skyline points in two dimensions- streak interval length and minimum value in the interval. Our solution thus hinges upon the idea to separate the two steps in prominent streak discovery' candidate streak generation and skyline operation over candidate streaks. For candidate generation, we propose the concept of local prominent streak (LPS). We prove that prominent streaks are a subset of LPSs and the number of LPSs is less than the length of a data sequence, in comparison with the quadratic number of candidates produced by a brute-force baseline method. We develop efficient algorithms based on the concept of LPS. The non-linear LPS-based method (NLPS) considers a superset of LPSs as candidates, and the linear LPS-based method (LLPS) further guarantees to consider only LPSs. The results of experiments using multiple real datasets verified the effectiveness of the proposed methods and showed orders of magnitude performance improvement against the baseline method.

【Keywords】: sequence database; skyline query; time-series database

164. Protecting location privacy using location semantics.

【Paper Link】【Pages】:1289-1297

【Authors】: Byoungyoung Lee ; Jinoh Oh ; Hwanjo Yu ; Jong Kim

【Abstract】: As the use of mobile devices increases, a location-based service (LBS) becomes increasingly popular because it provides more convenient context-aware services. However, LBS introduces problematic issues for location privacy due to the nature of the service. Location privacy protection methods based on k-anonymity and l-diversity have been proposed to provide anonymized use of LBS. However, the k-anonymity and l-diversity methods still can endanger the user's privacy because location semantic information could easily be breached while using LBS. This paper presents a novel location privacy protection technique, which protects the location semantics from an adversary. In our scheme, location semantics are first learned from location data. Then, the trusted-anonymization server performs the anonymization using the location semantic information by cloaking with semantically heterogeneous locations. Thus, the location semantic information is kept secure as the cloaking is done with semantically heterogeneous locations and the true location information is not delivered to the LBS applications. This paper proposes algorithms for learning location semantics and achieving semantically secure cloaking.

【Keywords】: location privacy; location semantics; theta-secure cloaking area

165. Ranking-based classification of heterogeneous information networks.

【Paper Link】【Pages】:1298-1306

【Authors】: Ming Ji ; Jiawei Han ; Marina Danilevsky

【Abstract】: It has been recently recognized that heterogeneous information networks composed of multiple types of nodes and links are prevalent in the real world. Both classification and ranking of the nodes (or data objects) in such networks are essential for network analysis. However, so far these approaches have generally been performed separately. In this paper, we combine ranking and classification in order to perform more accurate analysis of a heterogeneous information network. Our intuition is that highly ranked objects within a class should play more important roles in classification. On the other hand, class membership information is important for determining a quality ranking over a dataset. We believe it is therefore beneficial to integrate classification and ranking in a simultaneous, mutually enhancing process, and to this end, propose a novel ranking-based iterative classification framework, called RankClass. Specifically, we build a graph-based ranking model to iteratively compute the ranking distribution of the objects within each class. At each iteration, according to the current ranking results, the graph structure used in the ranking algorithm is adjusted so that the sub-network corresponding to the specific class is emphasized, while the rest of the network is weakened. As our experiments show, integrating ranking with classification not only generates more accurate classes than the state-of-art classification methods on networked data, but also provides meaningful ranking of objects within each class, serving as a more informative view of the data than traditional classification.

【Keywords】: classification; heterogeneous information network; ranking

166. Real-time bidding algorithms for performance-based display ad allocation.

【Paper Link】【Pages】:1307-1315

【Authors】: Ye Chen ; Pavel Berkhin ; Bo Anderson ; Nikhil R. Devanur

【Abstract】: We describe a real-time bidding algorithm for performance-based display ad allocation. A central issue in performance display advertising is matching campaigns to ad impressions, which can be formulated as a constrained optimization problem that maximizes revenue subject to constraints such as budget limits and inventory availability. The current practice is to solve the optimization problem offline at a tractable level of impression granularity (e.g., the page level), and to serve ads online based on the precomputed static delivery scheme. Although this offline approach takes a global view to achieve optimality, it fails to scale to ad allocation at the individual impression level. Therefore, we propose a real-time bidding algorithm that enables fine-grained impression valuation (e.g., targeting users with real-time conversion data), and adjusts value-based bids according to real-time constraint snapshots (e.g., budget consumption levels). Theoretically, we show that under a linear programming (LP) primal-dual formulation, the simple real-time bidding algorithm is indeed an online solver to the original primal problem by taking the optimal solution to the dual problem as input. In other words, the online algorithm guarantees the offline optimality given the same level of knowledge an offline optimization would have. Empirically, we develop and experiment with two real-time bid adjustment approaches to adapting to the non-stationary nature of the marketplace: one adjusts bids against real-time constraint satisfaction levels using control-theoretic methods, and the other adjusts bids also based on the statistically modeled historical bidding landscape. Finally, we show experimental results with real-world ad delivery data that support our theoretical conclusions.

【Keywords】: ad exchange; combinatorial optimization; linear programming; performance display; real-time bidding

167. Revisiting sequential pattern hiding to enhance utility.

【Paper Link】【Pages】:1316-1324

【Authors】: Aris Gkoulalas-Divanis ; Grigorios Loukides

【Abstract】: Sequence datasets are encountered in a plethora of applications spanning from web usage analysis to healthcare studies and ubiquitous computing. Disseminating such datasets offers remarkable opportunities for discovering interesting knowledge patterns, but may lead to serious privacy violations if sensitive patterns, such as business secrets, are disclosed. In this work, we consider how to sanitize data to prevent the disclosure of sensitive patterns during sequential pattern mining, while ensuring that the nonsensitive patterns can still be discovered. First, we re-define the problem of sequential pattern hiding to capture the information loss incurred by sanitization in terms of both events' modification (distortion) and lost nonsensitive knowledge patterns (side-effects). Second, we model sequences as graphs and propose two algorithms to solve the problem by operating on the graphs. The first algorithm attempts to sanitize data with minimal distortion, whereas the second focuses on reducing the side-effects. Extensive experiments show that our algorithms outperform the existing solution in terms of data distortion and side-effects and are more efficient.

【Keywords】: data privacy; knowledge hiding; sequential pattern hiding

168. Sampling hidden objects using nearest-neighbor oracles.

【Paper Link】【Pages】:1325-1333

【Authors】: Nilesh N. Dalvi ; Ravi Kumar ; Ashwin Machanavajjhala ; Vibhor Rastogi

【Abstract】: Given an unknown set of objects embedded in the Euclidean plane and a nearest-neighbor oracle, how to estimate the set size and other properties of the objects? In this paper we address this problem. We propose an efficient method that uses the Voronoi partitioning of the space by the objects and a nearest-neighbor oracle. Our method can be used in the hidden web/databases context where the goal is to estimate the number of certain objects of interest. Here, we assume that each object has a geographic location and the nearest-neighbor oracle can be realized by applications such as maps, local, or store-locator APIs. We illustrate the performance of our method on several real-world datasets.

【Keywords】: nearest-neighbors; sampling; size estimation; voronoi cell

169. Scalable kNN search on vertically stored time series.

【Paper Link】【Pages】:1334-1342

【Authors】: Shrikant Kashyap ; Panagiotis Karras

【Abstract】: Nearest-neighbor search over time series has received vast research attention as a basic data mining task. Still, none of the hitherto proposed methods scales well with increasing time-series length. This is due to the fact that all methods provide an one-off pruning capacity only. In particular, traditional methods utilize an index to search in a reduced-dimensionality feature space; however, for high time-series length, search with such an index yields many false hits that need to be eliminated by accessing the full records. An attempt to reduce false hits by indexing more features exacerbates the curse of dimensionality, and vice versa. A recently proposed alternative, iSAX, uses symbolic approximate representations accessed by a simple file-system directory as an index. Still, iSAX also encounters false hits, which are again eliminated by accessing records in full: once a false hit is generated by the index, there is no second chance to prune it; thus, the pruning capacity iSAX provides is also one-off. This paper proposes an alternative approach to time series kNN search, following a nontraditional pruning style. Instead of navigating through candidate records via an index, we access their features, obtained by a multi-resolution transform, in a stepwise sequential-scan manner, one level of resolution at a time, over a vertical representation. Most candidates are progressively eliminated after a few of their terms are accessed, using pre-computed information and an unprecedentedly tight double-bounding scheme, involving not only lower, but also upper distance bounds. Our experimental study with large, high-length time-series data confirms the advantage of our approach over both the current state-of-the-art method, iSAX, and classical index-based methods.

【Keywords】: nearest neighbor; similarity search; time series

170. Serendipitous learning: learning beyond the predefined label space.

【Paper Link】【Pages】:1343-1351

【Authors】: Dan Zhang ; Yan Liu ; Luo Si

【Abstract】: Most traditional supervised learning methods are developed to learn a model from labeled examples and use this model to classify the unlabeled ones into the same label space predefined by the models. However, in many real world applications, the label spaces for both the labeled/training and unlabeled/testing examples can be different. To solve this problem, this paper proposes a novel notion of Serendipitous Learning (SL), which is defined to address the learning scenarios in which the label space can be enlarged during the testing phase. In particular, a large margin approach is proposed to solve SL. The basic idea is to leverage the knowledge in the labeled examples to help identify novel/unknown classes, and the large margin formulation is proposed to incorporate both the classification loss on the examples within the known categories, as well as the clustering loss on the examples in unknown categories. An efficient optimization algorithm based on CCCP and the bundle method is proposed to solve the optimization problem of the large margin formulation of SL. Moreover, an efficient online learning method is proposed to address the issue of large scale data in online learning scenario, which has been shown to have a guaranteed learning regret. An extensive set of experimental results on two synthetic datasets and two datasets from real world applications demonstrate the advantages of the proposed method over several other baseline algorithms. One limitation of the proposed method is that the number of unknown classes is given in advance. It may be possible to remove this constraint if we model it by using a non-parametric way. We also plan to do experiments on more real world applications in the future.

【Keywords】: label space; maximum margin classification; serendipitous learning

171. Spatially regularized logistic regression for disease mapping on large moving populations.

【Paper Link】【Pages】:1352-1360

【Authors】: Vuk Malbasa ; Slobodan Vucetic

【Abstract】: Spatial analysis of disease risk, or disease mapping, typically relies on information about the residence and health status of individuals from population under study. However, residence information has its limitations because people are exposed to numerous disease risks as they spend time outside of their residences. Thanks to the wide-spread use of mobile phones and GPS-enabled devices, it is becoming possible to obtain a detailed record about the movement of human populations. Availability of movement information opens up an opportunity to improve the accuracy of disease mapping. Starting with an assumption that an individual's disease risk is a weighted average of risks at the locations which were visited, we show that disease mapping can be accomplished by spatially regularized logistic regression. Due to the inherent sparsity of movement data, the proposed approach can be applied to large populations and over large spatial grids. In our experiments, we were able to map disease for a simulated population with 1.6 million people and a spatial grid with 65 thousand locations in several minutes. The results indicate that movement information can improve the accuracy of disease mapping as compared to residential data only. We also studied a privacy-preserving scenario in which only the aggregate statistics are available about the movement of the overall population, while detailed movement information is available only for individuals with disease. The results indicate that the accuracy of disease mapping remains satisfactory when learning from movement data sanitized in this way.

【Keywords】: disease mapping; movement trajectories; privacy; regularization; spatial epidemiology; spatial-temporal data mining

172. Temporal multi-hierarchy smoothing for estimating rates of rare events.

【Paper Link】【Pages】:1361-1369

【Authors】: Nagaraj Kota ; Deepak Agarwal

【Abstract】: We consider the problem of estimating rates of rare events obtained through interactions among several categorical variables that are heavy-tailed and hierarchical. In our previous work, we proposed a scalable log-linear model called LMMH (Log-Linear Models for Multiple Hierarchies) that combats data sparsity at granular levels through small sample size corrections that borrow strength from rate estimates at coarser resolutions. This paper extends our previous work in two directions. First, we model excess heterogeneity by fitting local LMMH models to relatively homogeneous subsets of the data. To ensure scalable computation, these subsets are induced through a decision tree, we call this Treed-LMMH. Second, the Treed-LMMH method is coupled with temporal smoothing procedure based on a fast Kalman filter style algorithm. We show that simultaneously performing hierarchical and temporal smoothing leads to significant improvement in predictive accuracy. Our methods are illustrated on a large scale computational advertising dataset consisting of billions of observations and hundreds of millions of attribute combinations(cells).

【Keywords】: computational advertising; count data; decision trees; display advertising; kalman filtering; multi-hierarchy smoother

173. ThermoCast: a cyber-physical forecasting model for datacenters.

【Paper Link】【Pages】:1370-1378

【Authors】: Lei Li ; Chieh-Jan Mike Liang ; Jie Liu ; Suman Nath ; Andreas Terzis ; Christos Faloutsos

【Abstract】: Efficient thermal management is important in modern data centers as cooling consumes up to 50% of the total energy. Unlike previous work, we consider proactive thermal management, whereby servers can predict potential overheating events due to dynamics in data center configuration and workload, giving operators enough time to react. However, such forecasting is very challenging due to data center scales and complexity. Moreover, such a physical system is influenced by cyber effects, including workload scheduling in servers. We propose ThermoCast, a novel thermal forecasting model to predict the temperatures surrounding the servers in a data center, based on continuous streams of temperature and airflow measurements. Our approach is (a) capable of capturing cyberphysical interactions and automatically learning them from data; (b) computationally and physically scalable to data center scales; (c) able to provide online prediction with real-time sensor measurements. The paper's main contributions are: (i) We provide a systematic approach to integrate physical laws and sensor observations in a data center; (ii) We provide an algorithm that uses sensor data to learn the parameters of a data center's cyber-physical system. In turn, this ability enables us to reduce model complexity compared to full-fledged fluid dynamics models, while maintaining forecast accuracy; (iii) Unlike previous simulation-based studies, we perform experiments in a production data center. Using real data traces, we show that ThermoCast forecasts temperature better than a machine learning approach solely driven by data, and can successfully predict thermal alarms 4.2 minutes ahead of time.

【Keywords】: cyber physical modeling; data center energy efficiency; time series forecasting

174. Towards bounding sequential patterns.

【Paper Link】【Pages】:1379-1387

【Authors】: Chedy Raïssi ; Jian Pei

【Abstract】: Given a sequence database, can we have a non-trivial upper bound on the number of sequential patterns? The problem of bounding sequential patterns is very challenging in theory due to the combinatorial complexity of sequences, even given some inspiring results on bounding itemsets in frequent itemset mining. Moreover, the problem is highly meaningful in practice, since the upper bound can be used in many applications such as space allocation in building sequence data warehouses. In this paper, we tackle the problem of bounding sequential patterns by presenting, for the first time in the field of sequential pattern mining, strong combinatorial results on computing the number of possible sequential patterns that can be generated at a given length k. We introduce, as a case study, two novel techniques to estimate the number of candidate sequences. An extensive empirical study on both real data and synthetic data verifies the effectiveness of our methods.

【Keywords】: combinatorics; sequential pattern mining

175. User-click modeling for understanding and predicting search-behavior.

【Paper Link】【Pages】:1388-1396

【Authors】: Yuchen Zhang ; Weizhu Chen ; Dong Wang ; Qiang Yang

【Abstract】: Recent advances in search users' click modeling consider both users' search queries and click/skip behavior on documents to infer the user's perceived relevance. Most of these models, including dynamic Bayesian networks (DBN) and user browsing models (UBM), use probabilistic models to understand user click behavior based on individual queries. The user behavior is more complex when her actions to satisfy her information needs form a search session, which may include multiple queries and subsequent click behaviors on various items on search result pages. Previous research is limited to treating each query within a search session in isolation, without paying attention to their dynamic interactions with other queries in a search session. Investigating this problem, we consider the sequence of queries and their clicks in a search session as a task and propose a task-centric click model~(TCM). TCM characterizes user behavior related to a task as a collective whole. Specifically, we identify and consider two new biases in TCM as the basis for user modeling. The first indicates that users tend to express their information needs incrementally in a task, and thus perform more clicks as their needs become clearer. The other illustrates that users tend to click fresh documents that are not included in the results of previous queries. Using these biases, TCM is more accurately able to capture user search behavior. Extensive experimental results demonstrate that by considering all the task information collectively, TCM can better interpret user click behavior and achieve significant improvements in terms of ranking metrics of NDCG and perplexity.

【Keywords】: click log analysis; task-centric click model

【Paper Link】【Pages】:1397-1405

【Authors】: Chenhao Tan ; Lillian Lee ; Jie Tang ; Long Jiang ; Ming Zhou ; Ping Li

【Abstract】: We show that information about social relationships can be used to improve user-level sentiment analysis. The main motivation behind our approach is that users that are somehow "connected" may be more likely to hold similar opinions; therefore, relationship information can complement what we can extract about a user's viewpoints from their utterances. Employing Twitter as a source for our experimental data, and working within a semi-supervised framework, we propose models that are induced either from the Twitter follower/followee network or from the network in Twitter formed by users referring to each other using "@" mentions. Our transductive learning results reveal that incorporating social-network information can indeed lead to statistically significant sentiment classification improvements over the performance of an approach based on Support Vector Machines having access only to textual features.

【Keywords】: opinion mining; sentiment analysis; social networks; twitter

177. Web information extraction using markov logic networks.

【Paper Link】【Pages】:1406-1414

【Authors】: Sandeepkumar Satpal ; Sahely Bhadra ; Sundararajan Sellamanickam ; Rajeev Rastogi ; Prithviraj Sen

【Abstract】: In this paper, we consider the problem of extracting structured data from web pages taking into account both the content of individual attributes as well as the structure of pages and sites. We use Markov Logic Networks (MLNs) to capture both content and structural features in a single unified framework, and this enables us to perform more accurate inference. MLNs allow us to model a wide range of rich structural features like proximity, precedence, alignment, and contiguity, using first-order clauses. We show that inference in our information extraction scenario reduces to solving an instance of the maximum weight subgraph problem. We develop specialized procedures for solving the maximum subgraph variants that are far more efficient than previously proposed inference methods for MLNs that solve variants of MAX-SAT. Experiments with real-life datasets demonstrate the effectiveness of our MLN-based approach compared to existing state-of-the-art extraction methods.

【Keywords】: Markov logic networks; information extraction; machine learned models; probabilistic models