30. ICML 2013:Atlanta, GA, USA

Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013. JMLR.org 【DBLP Link

Paper Num: 283 || Session Num: 3

Cycle 1 Papers 74

1. An Optimal Policy for Target Localization with Application to Electron Microscopy.

Paper Link】 【Pages】:1-9

【Authors】: Raphael Sznitman ; Aurélien Lucchi ; Peter I. Frazier ; Bruno Jedynak ; Pascal Fua

【Abstract】: This paper considers the task of finding a target location by making a limited number of sequential observations. Each observation results from evaluating an imperfect classifier of a chosen cost and accuracy on an interval of chosen length and position. Within a Bayesian framework, we study the problem of minimizing an objective that combines the entropy of the posterior distribution with the cost of the questions asked. In this problem, we show that the one-step lookahead policy is Bayes-optimal for any arbitrary time horizon. Moreover, this one-step lookahead policy is easy to compute and implement. We then use this policy in the context of localizing mitochondria in electron microscope images, and experimentally show that significant speed ups in acquisition can be gained, while maintaining near equal image quality at target locations, when compared to current policies.

【Keywords】:

2. Domain Generalization via Invariant Feature Representation.

Paper Link】 【Pages】:10-18

【Authors】: Krikamol Muandet ; David Balduzzi ; Bernhard Schölkopf

【Abstract】: This paper investigates domain generalization: How to take knowledge acquired from an arbitrary number of related domains and apply it to previously unseen domains? We propose Domain-Invariant Component Analysis (DICA), a kernel-based optimization algorithm that learns an invariant transformation by minimizing the dissimilarity across domains, whilst preserving the functional relationship between input and output variables. A learning-theoretic analysis shows that reducing dissimilarity improves the expected generalization ability of classifiers on new domains, motivating the proposed algorithm. Experimental results on synthetic and real-world datasets demonstrate that DICA successfully learns invariant features and improves classifier performance in practice.

【Keywords】:

3. A Spectral Learning Approach to Range-Only SLAM.

Paper Link】 【Pages】:19-26

【Authors】: Byron Boots ; Geoffrey J. Gordon

【Abstract】: We present a novel spectral learning algorithm for simultaneous localization and mapping (SLAM) from range data with known correspondences. This algorithm is an instance of a general spectral system identification framework, from which it inherits several desirable properties, including statistical consistency and no local optima. Compared with popular batch optimization or multiple-hypothesis tracking (MHT) methods for range-only SLAM, our spectral approach offers guaranteed low computational requirements and good tracking performance. Compared with MHT and with popular extended Kalman filter (EKF) or extended information filter (EIF) approaches, our approach does not need to linearize a transition or measurement model. We provide a theoretical analysis of our method, including finite-sample error bounds. Finally, we demonstrate on a real-world robotic SLAM problem that our algorithm is not only theoretically justified, but works well in practice: in a comparison of multiple methods, the lowest errors come from a combination of our algorithm with batch optimization, but our method alone produces nearly as good a result at far lower computational cost.

【Keywords】:

4. Near-Optimal Bounds for Cross-Validation via Loss Stability.

Paper Link】 【Pages】:27-35

【Authors】: Ravi Kumar ; Daniel Lokshtanov ; Sergei Vassilvitskii ; Andrea Vattani

【Abstract】: Multi-fold cross-validation is an established practice to estimate the error rate of a learning algorithm. Quantifying the variance reduction gains due to cross-validation has been challenging due to the inherent correlations introduced by the folds. In this work we introduce a new and weak measure of stability called loss stability and relate the cross-validation performance to loss stability; we also establish that this relationship is near-optimal. Our work thus quantitatively improves the current best bounds on cross-validation.

【Keywords】:

5. Sparsity-Based Generalization Bounds for Predictive Sparse Coding.

Paper Link】 【Pages】:36-44

【Authors】: Nishant Ajay Mehta ; Alexander G. Gray

【Abstract】: The goal of predictive sparse coding is to learn a representation of examples as sparse linear combinations of elements from a dictionary, such that a learned hypothesis linear in the new representation performs well on a predictive task. Predictive sparse coding has demonstrated impressive performance on a variety of supervised tasks, but its generalization properties have not been studied. We establish the first generalization error bounds for predictive sparse coding, in the overcomplete setting, where the number of features k exceeds the original dimensionality d. The learning bound decays as (sqrt(d k/m)) with respect to d, k, and the size m of the training sample. It depends intimately on stability properties of the learned sparse encoder, as measured on the training sample. Consequently, we also present a fundamental stability result for the LASSO, a result that characterizes the stability of the sparse codes with respect to dictionary perturbations.

【Keywords】:

6. Sparse Uncorrelated Linear Discriminant Analysis.

Paper Link】 【Pages】:45-52

【Authors】: Xiaowei Zhang ; Delin Chu

【Abstract】: In this paper, we develop a novel approach for sparse uncorrelated linear discriminant analysis (ULDA). Our proposal is based on characterization of all solutions of the generalized ULDA. We incorporate sparsity into the ULDA transformation by seeking the solution with minimum (\ell1)-norm from all minimum dimension solutions of the generalized ULDA. The problem is then formulated as a (\ell{1})-minimization problem and is solved by accelerated linearized Bregman method. Experiments on high-dimensional gene expression data demonstrate that our approach not only computes extremely sparse solutions but also performs well in classification. Experimental results also show that our approach can help for data visualization in low-dimensional space.

【Keywords】:

7. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs.

Paper Link】 【Pages】:53-61

【Authors】: Simon Lacoste-Julien ; Martin Jaggi ; Mark W. Schmidt ; Patrick Pletscher

【Abstract】: We propose a randomized block-coordinate variant of the classic Frank-Wolfe algorithm for convex optimization with block-separable constraints. Despite its lower iteration cost, we show that it achieves a similar convergence rate in duality gap as the full Frank-Wolfe algorithm. We also show that, when applied to the dual structural support vector machine (SVM) objective, this yields an online algorithm that has the same low iteration complexity as primal stochastic subgradient methods. However, unlike stochastic subgradient methods, the block-coordinate Frank-Wolfe algorithm allows us to compute the optimal step-size and yields a computable duality gap guarantee. Our experiments indicate that this simple algorithm outperforms competing structural SVM solvers.

【Keywords】:

8. Fast Probabilistic Optimization from Noisy Gradients.

Paper Link】 【Pages】:62-70

【Authors】: Philipp Hennig

【Abstract】: Stochastic gradient descent remains popular in large-scale machine learning, on account of its very low computational cost and robustness to noise. However, gradient descent is only linearly efficient and not transformation invariant. Scaling by a local measure can substantially improve its performance. One natural choice of such a scale is the Hessian of the objective function: Were it available, it would turn linearly efficient gradient descent into the quadratically efficient Newton-Raphson optimization. Existing covariant methods, though, are either super-linearly expensive or do not address noise. Generalising recent results, this paper constructs a nonparametric Bayesian quasi-Newton algorithm that learns gradient and Hessian from noisy evaluations of the gradient. Importantly, the resulting algorithm, like stochastic gradient descent, has cost linear in the number of input dimensions.

【Keywords】:

9. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes.

Paper Link】 【Pages】:71-79

【Authors】: Ohad Shamir ; Tong Zhang

【Abstract】: Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines. In this paper, we investigate the performance of SGD without such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after (T) rounds, the suboptimality of the last SGD iterate scales as (O(\log(T)/\sqrt{T})) for non-smooth convex objective functions, and (O(\log(T)/T)) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in is not as simple to implement). Finally, we provide some experimental illustrations.

【Keywords】:

10. Stochastic Alternating Direction Method of Multipliers.

Paper Link】 【Pages】:80-88

【Authors】: Hua Ouyang ; Niao He ; Long Tran ; Alexander G. Gray

【Abstract】: The Alternating Direction Method of Multipliers (ADMM) has received lots of attention recently due to the tremendous demand from large-scale and data-distributed machine learning applications. In this paper, we present a stochastic setting for optimization problems with non-smooth composite objective functions. To solve this problem, we propose a stochastic ADMM algorithm. Our algorithm applies to a more general class of convex and nonsmooth objective functions, beyond the smooth and separable least squares loss used in lasso. We also demonstrate the rates of convergence for our algorithm under various structural assumptions of the stochastic function: (O(1/\sqrt{t})) for convex functions and (O(\log t/t)) for strongly convex functions. Compared to previous literature, we establish the convergence rate of ADMM for convex problems in terms of both the objective value and the feasibility violation. A novel application named Graph-Guided SVM is proposed to demonstrate the usefulness of our algorithm.

【Keywords】:

11. Noisy Sparse Subspace Clustering.

Paper Link】 【Pages】:89-97

【Authors】: Yu-Xiang Wang ; Huan Xu

【Abstract】: This paper considers the problem of subspace clustering under noise. Specifically, we study the behavior of Sparse Subspace Clustering (SSC) when either adversarial or random noise is added to the unlabelled input data points, which are assumed to lie in a union of low-dimensional subspaces. We show that a modified version of SSC is provably effective in correctly identifying the underlying subspaces, even with noisy data. This extends theoretical guarantee of this algorithm to the practical setting and provides justification to the success of SSC in a class of real applications.

【Keywords】:

12. Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models.

Paper Link】 【Pages】:98-106

【Authors】: Sinead Williamson ; Avinava Dubey ; Eric P. Xing

【Abstract】: Nonparametric mixture models based on the Dirichlet process are an elegant alternative to finite models when the number of underlying components is unknown, but inference in such models can be slow. Existing attempts to parallelize inference in such models have relied on introducing approximations, which can lead to inaccuracies in the posterior estimate. In this paper, we describe auxiliary variable representations for the Dirichlet process and the hierarchical Dirichlet process that allow us to perform MCMC using the correct equilibrium distribution, in a distributed manner. We show that our approach allows scalable inference without the deterioration in estimate quality that accompanies existing methods.

【Keywords】:

13. Risk Bounds and Learning Algorithms for the Regression Approach to Structured Output Prediction.

Paper Link】 【Pages】:107-114

【Authors】: Sébastien Giguère ; François Laviolette ; Mario Marchand ; Khadidja Sylla

【Abstract】: We provide rigorous guarantees for the regression approach to structured output prediction. We show that the quadratic regression loss is a convex surrogate of the prediction loss when the output kernel satisfies some condition with respect to the prediction loss. We provide two upper bounds of the prediction risk that depend on the empirical quadratic risk of the predictor. The minimizer of the first bound is the predictor proposed by Cortes et al. (2007) while the minimizer of the second bound is a predictor that has never been proposed so far. Both predictors are compared on practical tasks.

【Keywords】:

14. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures.

Paper Link】 【Pages】:115-123

【Authors】: James Bergstra ; Daniel Yamins ; David D. Cox

【Abstract】: Many computer vision algorithms depend on configuration settings that are typically hand-tuned in the course of evaluating the algorithm for a particular data set. While such parameter tuning is often presented as being incidental to the algorithm, correctly setting these parameter choices is frequently critical to realizing a method’s full potential. Compounding matters, these parameters often must be re-tuned when the algorithm is applied to a new problem domain, and the tuning process itself often depends on personal experience and intuition in ways that are hard to quantify or describe. Since the performance of a given technique depends on both the fundamental quality of the algorithm and the details of its tuning, it is sometimes difficult to know whether a given technique is genuinely better, or simply better tuned. In this work, we propose a meta-modeling approach to support automated hyperparameter optimization, with the goal of providing practical tools that replace hand-tuning with a reproducible and unbiased optimization process. Our approach is to expose the underlying expression graph of how a performance metric (e.g. classification accuracy on validation examples) is computed from hyperparameters that govern not only how individual processing steps are applied, but even which processing steps are included. A hyperparameter optimization algorithm transforms this graph into a program for optimizing that performance metric. Our approach yields state of the art results on three disparate computer vision problems: a face-matching verification task (LFW), a face identification task (PubFig83) and an object recognition task (CIFAR-10), using a single broad class of feed-forward vision architectures.

【Keywords】:

15. Gibbs Max-Margin Topic Models with Fast Sampling Algorithms.

Paper Link】 【Pages】:124-132

【Authors】: Jun Zhu ; Ning Chen ; Hugh Perkins ; Bo Zhang

【Abstract】: Existing max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents Gibbs max-margin supervised topic models by minimizing an expected margin loss, an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables, we develop simple and fast Gibbs sampling algorithms with no restricting assumptions and no need to solve SVM subproblems for both classification and regression. Empirical results demonstrate significant improvements on time efficiency. The classification performance is also significantly improved over competitors.

【Keywords】:

16. Cost-Sensitive Tree of Classifiers.

Paper Link】 【Pages】:133-141

【Authors】: Zhixiang Eddie Xu ; Matt J. Kusner ; Kilian Q. Weinberger ; Minmin Chen

【Abstract】: Recently, machine learning algorithms have successfully entered large-scale real-world industrial applications (e.g. search engines and email spam filters). Here, the CPU cost during test-time must be budgeted and accounted for. In this paper, we address the challenge of balancing test-time cost and the classifier accuracy in a principled fashion. The test-time cost of a classifier is often dominated by the computation required for feature extraction-which can vary drastically across features. We incorporate this extraction time by constructing a tree of classifiers, through which test inputs traverse along individual paths. Each path extracts different features and is optimized for a specific sub-partition of the input space. By only computing features for inputs that benefit from them the most, our cost-sensitive tree of classifiers can match the high accuracies of the current state-of-the-art at a small fraction of the computational cost.

【Keywords】:

17. Learning Hash Functions Using Column Generation.

Paper Link】 【Pages】:142-150

【Authors】: Xi Li ; Guosheng Lin ; Chunhua Shen ; Anton van den Hengel ; Anthony R. Dick

【Abstract】: Fast nearest neighbor searching is becoming an increasingly important tool in solving many large-scale problems. Recently a number of approaches to learning data-dependent hash functions have been developed. In this work, we propose a column generation based method for learning data-dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise proximity comparison information, our method learns hash functions that preserve the relative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a training objective which is convex, thus ensuring that the global optimum can be identified. Experiments demonstrate that the proposed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets.

【Keywords】:

18. Combinatorial Multi-Armed Bandit: General Framework and Applications.

Paper Link】 【Pages】:151-159

【Authors】: Wei Chen ; Yajun Wang ; Yang Yuan

【Abstract】: We define a general framework for a large class of combinatorial multi-armed bandit (CMAB) problems, where simple arms with unknown istributions form super arms. In each round, a super arm is played and the outcomes of its related simple arms are observed, which helps the selection of super arms in future rounds. The reward of the super arm depends on the outcomes of played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an ((\alpha,\beta))-approximation oracle that takes the means of the distributions of arms and outputs a super arm that with probability (\beta) generates an (\alpha) fraction of the optimal expected reward. The objective of a CMAB algorithm is to minimize ((\alpha,\beta))-approximation regret, which is the difference in total expected reward between the (\alpha\beta) fraction of expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves (O(\log n)) regret, where (n) is the number of rounds played, and we further provide distribution-independent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound for classical MAB problem up to a constant factor, and it significantly improves the regret bound in a recent paper on combinatorial bandits with linear rewards. We apply our CMAB framework to two new applications, probabilistic maximum coverage (PMC) for online advertising and social influence maximization for viral marketing, both having nonlinear reward structures.

【Keywords】:

19. Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization.

Paper Link】 【Pages】:160-168

【Authors】: Yuxin Chen ; Andreas Krause

【Abstract】: Active learning can lead to a dramatic reduction in labeling effort. However, in many practical implementations (such as crowdsourcing, surveys, high-throughput experimental design), it is preferable to query labels for batches of examples to be labelled in parallel. While several heuristics have been proposed for batch-mode active learning, little is known about their theoretical performance. We consider batch mode active learning and more general information-parallel stochastic optimization problems that exhibit adaptive submodularity, a natural diminishing returns condition. We prove that for such problems, a simple greedy strategy is competitive with the optimal batch-mode policy. In some cases, surprisingly, the use of batches incurs competitively low cost, even when compared to a fully sequential strategy. We demonstrate the effectiveness of our approach on batch-mode active learning tasks, where it outperforms the state of the art, as well as the novel problem of multi-stage influence maximization in social networks.

【Keywords】:

20. Convex formulations of radius-margin based Support Vector Machines.

Paper Link】 【Pages】:169-177

【Authors】: Huyen Do ; Alexandros Kalousis

【Abstract】: We consider Support Vector Machines (SVMs) learned together with linear transformations of the feature spaces on which they are applied. Under this scenario the radius of the smallest data enclosing sphere is no longer fixed. Therefore optimizing the SVM error bound by considering both the radius and the margin has the potential to deliver a tighter error bound. In this paper we present two novel algorithms: (R-SVM{\mu}^+)—a SVM radius-margin based feature selection algorithm, and (R-SVM^+) — a metric learning-based SVM. We derive our algorithms by exploiting a new tighter approximation of the radius and a metric learning interpretation of SVM. Both optimize directly the radius-margin error bound using linear transformations. Unlike almost all existing radius-margin based SVM algorithms which are either non-convex or combinatorial, our algorithms are standard quadratic convex optimization problems with linear or quadratic constraints. We perform a number of experiments on benchmark datasets. (R-SVM{\mu}^+) exhibits excellent feature selection performance compared to the state-of-the-art feature selection methods, such as (L_1)-norm and elastic-net based methods. (R-SVM^+) achieves a significantly better classification performance compared to SVM and its other state-of-the-art variants. From the results it is clear that the incorporation of the radius, as a means to control the data spread, in the cost function has strong beneficial effects.

【Keywords】:

21. Modelling Sparse Dynamical Systems with Compressed Predictive State Representations.

Paper Link】 【Pages】:178-186

【Authors】: William L. Hamilton ; Mahdi Milani Fard ; Joelle Pineau

【Abstract】: Efficiently learning accurate models of dynamical systems is of central importance for developing rational agents that can succeed in a wide range of challenging domains. The difficulty of this learning problem is particularly acute in settings with large observation spaces and partial observability. We present a new algorithm, called Compressed Predictive State Representation (CPSR), for learning models of high-dimensional partially observable uncontrolled dynamical systems from small sample sets. The algorithm, which extends previous work on Predictive State Representations, exploits a particular sparse structure present in many domains. This sparse structure is used to compress information during learning, allowing for an increase in both the efficiency and predictive power. The compression technique also relieves the burden of domain specific feature selection and allows for domains with extremely large discrete observation spaces to be efficiently modelled. We present empirical results showing that the algorithm is able to build accurate models more efficiently than its uncompressed counterparts, and provide theoretical results on the accuracy of the learned compressed model.

【Keywords】:

22. A Machine Learning Framework for Programming by Example.

Paper Link】 【Pages】:187-195

【Authors】: Aditya Krishna Menon ; Omer Tamuz ; Sumit Gulwani ; Butler W. Lampson ; Adam Kalai

【Abstract】: Learning programs is a timely and interesting challenge. In Programming by Example (PBE), a system attempts to infer a program from input and output examples alone, by searching for a composition of some set of base functions. We show how machine learning can be used to speed up this seemingly hopeless search problem, by learning weights that relate textual features describing the provided input-output examples to plausible sub-components of a program. This generic learning framework lets us address problems beyond the scope of earlier PBE systems. Experiments on a prototype implementation show that learning improves search and ranking on a variety of text processing tasks found on help forums.

【Keywords】:

23. Discriminatively Activated Sparselets.

Paper Link】 【Pages】:196-204

【Authors】: Ross B. Girshick ; Hyun Oh Song ; Trevor Darrell

【Abstract】: Shared representations are highly appealing due to their potential for gains in computational and statistical efficiency. Compressing a shared representation leads to greater computational savings, but at the same time can severely decrease performance on a target task. Recently, sparselets (Song et al., 2012) were introduced as a new shared intermediate representation for multiclass object detection with deformable part models (Felzenszwalb et al., 2010a), showing significant speedup factors, but with a large decrease in task performance. In this paper we describe a new training framework that learns which sparselets to activate in order to optimize a discriminative objective, leading to larger speedup factors with no decrease in task performance. We first reformulate sparselets in a general structured output prediction framework, then analyze when sparselets lead to computational efficiency gains, and lastly show experimental results on object detection and image classification tasks. Our experimental results demonstrate that discriminative activation substantially outperforms the previous reconstructive approach which, together with our structured output prediction formulation, make sparselets broadly applicable and significantly more effective.

【Keywords】:

24. The Pairwise Piecewise-Linear Embedding for Efficient Non-Linear Classification.

Paper Link】 【Pages】:205-213

【Authors】: Ofir Pele ; Ben Taskar ; Amir Globerson ; Michael Werman

【Abstract】: Linear classiffers are much faster to learn and test than non-linear ones. On the other hand, non-linear kernels offer improved performance, albeit at the increased cost of training kernel classiffers. To use non-linear mappings with efficient linear learning algorithms, explicit embeddings that approximate popular kernels have recently been proposed. However, the embedding process itself is often costly and the results are usually less accurate than kernel methods. In this work we propose a non-linear feature map that is both very efficient, but at the same time highly expressive. The method is based on discretization and interpolation of individual features values and feature pairs. The discretization allows us to model different regions of the feature space separately, while the interpolation preserves the original continuous values. Using this embedding is strictly more general than a linear model and as efficient as the second-order polynomial explicit feature map. An extensive empirical evaluation shows that our method consistently signiffcantly outperforms other methods, including a wide range of kernels. This is in contrast to other proposed embeddings that were faster than kernel methods, but with lower accuracy.

【Keywords】:

25. Fixed-Point Model For Structured Labeling.

Paper Link】 【Pages】:214-221

【Authors】: Quannan Li ; Jingdong Wang ; David P. Wipf ; Zhuowen Tu

【Abstract】: In this paper, we propose a simple but effective solution to the structured labeling problem: a fixed-point model. Recently, layered models with sequential classifiers/regressors have gained an increasing amount of interests for structural prediction. Here, we design an algorithm with a new perspective on layered models; we aim to find a fixed-point function with the structured labels being both the output and the input. Our approach alleviates the burden in learning multiple/different classifiers in different layers. We devise a training strategy for our method and provide justifications for the fixed-point function to be a contraction mapping. The learned function captures rich contextual information and is easy to train and test. On several widely used benchmark datasets, the proposed method observes significant improvement in both performance and efficiency over many state-of-the-art algorithms.

【Keywords】:

26. Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation.

Paper Link】 【Pages】:222-230

【Authors】: Boqing Gong ; Kristen Grauman ; Fei Sha

【Abstract】: Learning domain-invariant features is of vital importance to unsupervised domain adaptation, where classifiers trained on the source domain need to be adapted to a different target domain for which no labeled examples are available. In this paper, we propose a novel approach for learning such features. The central idea is to exploit the existence of landmarks, which are a subset of labeled data instances in the source domain that are distributed most similarly to the target domain. Our approach automatically discovers the landmarks and use them to bridge the source to the target by constructing provably easier auxiliary domain adaptation tasks. The solutions of those auxiliary tasks form the basis to compose invariant features for the original task. We show how this composition can be optimized discriminatively without requiring labels from the target domain. We validate the method on standard benchmark datasets for visual object recognition and sentiment analysis of text. Empirical results show the proposed method outperforms the state-of-the-art significantly.

【Keywords】:

27. Fast Conical Hull Algorithms for Near-separable Non-negative Matrix Factorization.

Paper Link】 【Pages】:231-239

【Authors】: Abhishek Kumar ; Vikas Sindhwani ; Prabhanjan Kambadur

【Abstract】: The separability assumption (Arora et al., 2012; Donoho & Stodden, 2003) turns non-negative matrix factorization (NMF) into a tractable problem. Recently, a new class of provably-correct NMF algorithms have emerged under this assumption. In this paper, we reformulate the separable NMF problem as that of finding the extreme rays of the conical hull of a finite set of vectors. From this geometric perspective, we derive new separable NMF algorithms that are highly scalable and empirically noise robust, and have several favorable properties in relation to existing methods. A parallel implementation of our algorithm scales excellently on shared and distributed-memory machines.

【Keywords】:

28. Principal Component Analysis on non-Gaussian Dependent Data.

Paper Link】 【Pages】:240-248

【Authors】: Fang Han ; Han Liu

【Abstract】: In this paper, we analyze the performance of a semiparametric principal component analysis named Copula Component Analysis (COCA) (Han & Liu, 2012) when the data are dependent. The semiparametric model assumes that, after unspecified marginally monotone transformations, the distributions are multivariate Gaussian. We study the scenario where the observations are drawn from non-i.i.d. processes (m-dependency or a more general phi-mixing case). We show that COCA can allow weak dependence. In particular, we provide the generalization bounds of convergence for both support recovery and parameter estimation of COCA for the dependent data. We provide explicit sufficient conditions on the degree of dependence, under which the parametric rate can be maintained. To our knowledge, this is the first work analyzing the theoretical performance of PCA for the dependent data in high dimensional settings. Our results strictly generalize the analysis in Han & Liu (2012) and the techniques we used have the separate interest for analyzing a variety of other multivariate statistical methods.

【Keywords】:

29. Learning Linear Bayesian Networks with Latent Variables.

Paper Link】 【Pages】:249-257

【Authors】: Animashree Anandkumar ; Daniel J. Hsu ; Adel Javanmard ; Sham Kakade

【Abstract】: This work considers the problem of learning linear Bayesian networks when some of the variables are unobserved. Identifiability and efficient recovery from low-order observable moments are established under a novel graphical constraint. The constraint concerns the expansion properties of the underlying directed acyclic graph (DAG) between observed and unobserved variables in the network, and it is satisfied by many natural families of DAGs that include multi-level DAGs, DAGs with effective depth one, as well as certain families of polytrees.

【Keywords】:

30. Multiple Identifications in Multi-Armed Bandits.

Paper Link】 【Pages】:258-265

【Authors】: Sébastien Bubeck ; Tengyao Wang ; Nitin Viswanathan

【Abstract】: We study the problem of identifying the top m arms in a multi-armed bandit game. Our proposed solution relies on a new algorithm based on successive rejects of the seemingly bad arms, and successive accepts of the good ones. This algorithmic contribution allows to tackle other multiple identifications settings that were previously out of reach. In particular we show that this idea of successive accepts and rejects applies to the multi-bandit best arm identification problem.

【Keywords】:

31. Learning Optimally Sparse Support Vector Machines.

Paper Link】 【Pages】:266-274

【Authors】: Andrew Cotter ; Shai Shalev-Shwartz ; Nati Srebro

【Abstract】: We show how to train SVMs with an optimal guarantee on the number of support vectors (up to constants), and with sample complexity and training runtime bounds matching the best known for kernel SVM optimization (i.e. without any additional asymptotic cost beyond standard SVM training). Our method is simple to implement and works well in practice.

【Keywords】:

32. Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks.

Paper Link】 【Pages】:275-283

【Authors】: Creighton Heaukulani ; Zoubin Ghahramani

【Abstract】: Current Bayesian models for dynamic social network data have focused on modelling the influence of evolving unobserved structure on observed social interactions. However, an understanding of how observed social relationships from the past affect future unobserved structure in the network has been neglected. In this paper, we introduce a new probabilistic model for capturing this phenomenon, which we call latent feature propagation, in social networks. We demonstrate our model’s capability for inferring such latent structure in varying types of social network datasets, and experimental studies show this structure achieves higher predictive performance on link prediction and forecasting tasks.

【Keywords】:

33. Efficient Sparse Group Feature Selection via Nonconvex Optimization.

Paper Link】 【Pages】:284-292

【Authors】: Shuo Xiang ; Xiaoshen Tong ; Jieping Ye

【Abstract】: Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. While promising, most of the existing works use convex methods, which may be suboptimal in terms of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex paradigm to sparse group feature selection, which is motivated by applications that require identifying the underlying group structure and performing feature selection simultaneously. The main contributions of this article are twofold: (1) computationally, we introduce a nonconvex sparse group feature selection model and present an efficient optimization algorithm, of which the key step is a projection with two coupled constraints; (2) statistically, we show that the proposed model can reconstruct the oracle estimator. Therefore, consistent feature selection and parameter estimation can be achieved. Numerical results on synthetic and real-world data suggest that the proposed nonconvex method compares favorably against its competitors, thus achieving desired goal of delivering high performance.

【Keywords】:

34. Domain Adaptation for Sequence Labeling Tasks with a Probabilistic Language Adaptation Model.

Paper Link】 【Pages】:293-301

【Authors】: Min Xiao ; Yuhong Guo

【Abstract】: In this paper, we propose to address the problem of domain adaptation for sequence labeling tasks via distributed representation learning by using a log-bilinear language adaptation model. The proposed neural probabilistic language model simultaneously models two different but related data distributions in the source and target domains based on induced distributed representations, which encode both generalizable and domain-specific latent features. We then use the learned dense real-valued representation as augmenting features for natural language processing systems. We empirically evaluate the proposed learning technique on WSJ and MEDLINE domains with POS tagging systems, and on WSJ and Brown corpora with syntactic chunking and name entity recognition systems. Our primary results show that the proposed domain adaptation method outperforms a number comparison methods for cross domain sequence labeling tasks.

【Keywords】:

35. Maximum Variance Correction with Application to A* Search.

Paper Link】 【Pages】:302-310

【Authors】: Wenlin Chen ; Kilian Q. Weinberger ; Yixin Chen

【Abstract】: In this paper we introduce Maximum Variance Correction (MVC), which finds large-scale feasible solutions to Maximum Variance Unfolding (MVU) by post-processing embeddings from any manifold learning algorithm. It increases the scale of MVU embeddings by several orders of magnitude and is naturally parallel. This unprecedented scalability opens up new avenues of applications for manifold learning, in particular the use of MVU embeddings as effective heuristics to speed-up A search (Rayner et al. 2011). We demonstrate that MVC embeddings lead to un-matched reductions in search time across several non-trivial A benchmark search problems and bridge the gap between the manifold learning literature and one of its most promising high impact applications.

【Keywords】:

36. Adaptive Sparsity in Gaussian Graphical Models.

Paper Link】 【Pages】:311-319

【Authors】: Eleanor Wong ; Suyash P. Awate ; P. Thomas Fletcher

【Abstract】: An effective approach to structure learning and parameter estimation for Gaussian graphical models is to impose a sparsity prior, such as a Laplace prior, on the entries of the precision matrix. Such an approach involves a hyperparameter that must be tuned to control the amount of sparsity. In this paper, we introduce a parameter-free method for estimating a precision matrix with sparsity that adapts to the data automatically. We achieve this by formulating a hierarchical Bayesian model of the precision matrix with a non-informative Jeffreys’ hyperprior. We also naturally enforce the symmetry and positive-definiteness constraints on the precision matrix by parameterizing it with the Cholesky decomposition. Experiments on simulated and real (cell signaling) data demonstrate that the proposed approach not only automatically adapts the sparsity of the model, but it also results in improved estimates of the precision matrix compared to the Laplace prior model with sparsity parameter chosen by cross-validation.

【Keywords】:

37. Average Reward Optimization Objective In Partially Observable Domains.

Paper Link】 【Pages】:320-328

【Authors】: Yuri Grinberg ; Doina Precup

【Abstract】: We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs). The key to average-reward computation is to have a well-defined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.

【Keywords】:

38. Feature Selection in High-Dimensional Classification.

Paper Link】 【Pages】:329-337

【Authors】: Mladen Kolar ; Han Liu

【Abstract】: High-dimensional discriminant analysis is of fundamental importance in multivariate statistics. Existing theoretical results sharply characterize different procedures, providing sharp convergence results for the classification risk, as well as the l2 convergence results to the discriminative rule. However, sharp theoretical results for the problem of variable selection have not been established, even though model interpretation is of importance in many scientific domains. In this paper, we bridge this gap by providing sharp sufficient conditions for consistent variable selection using the ROAD estimator (Fan et al., 2010). Our results provide novel theoretical insights for the ROAD estimator. Sufficient conditions are complemented by the necessary information theoretic limits on variable selection in high-dimensional discriminant analysis. This complementary result also establishes optimality of the ROAD estimator for a certain family of problems.

【Keywords】:

39. Human Boosting.

Paper Link】 【Pages】:338-346

【Authors】: Harsh H. Pareek ; Pradeep Ravikumar

【Abstract】: Humans may be exceptional learners but they have biological limitations and moreover, inductive biases similar to machine learning algorithms. This puts limits on human learning ability and on the kinds of learning tasks humans can easily handle. In this paper, we consider the problem of “boosting” human learners to extend the learning ability of human learners and achieve improved performance on tasks which individual humans find difficult. We consider classification (category learning) tasks, propose a boosting algorithm for human learners and give theoretical justifications. We conduct experiments using Amazon’s Mechanical Turk on two synthetic datasets – a crosshair task with a nonlinear decision boundary and a gabor patch task with a linear boundary but which is inaccessible to human learners – and one real world dataset – the Opinion Spam detection task introduced in (Ott et al). Our results show that boosting human learners produces gains in accuracy and can overcome some fundamental limitations of human learners.

【Keywords】:

40. Efficient Dimensionality Reduction for Canonical Correlation Analysis.

Paper Link】 【Pages】:347-355

【Authors】: Haim Avron ; Christos Boutsidis ; Sivan Toledo ; Anastasios Zouzias

【Abstract】: We present a fast algorithm for approximate Canonical Correlation Analysis (CCA). Given a pair of tall-and-thin matrices, the proposed algorithm first employs a randomized dimensionality reduction transform to reduce the size of the input matrices, and then applies any standard CCA algorithm to the new pair of matrices. The algorithm computes an approximate CCA to the original pair of matrices with provable guarantees, while requiring asymptotically less operations than the state-of-the-art exact algorithms.

【Keywords】:

41. Parsing epileptic events using a Markov switching process model for correlated time series.

Paper Link】 【Pages】:356-364

【Authors】: Drausin Wulsin ; Emily B. Fox ; Brian Litt

【Abstract】: Patients with epilepsy can manifest short, sub-clinical epileptic “bursts” in addition to full-blown clinical seizures. We believe the relationship between these two classes of events—something not previously studied quantitatively—could yield important insights into the nature and intrinsic dynamics of seizures. A goal of our work is to parse these complex epileptic events into distinct dynamic regimes. A challenge posed by the intracranial EEG (iEEG) data we study is the fact that the number and placement of electrodes can vary between patients. We develop a Bayesian nonparametric Markov switching process that allows for (i) shared dynamic regimes between a variable numbers of channels, (ii) asynchronous regime-switching, and (iii) an unknown dictionary of dynamic regimes. We encode a sparse and changing set of dependencies between the channels using a Markov-switching Gaussian graphical model for the innovations process driving the channel dynamics. We demonstrate the importance of this model in parsing and out-of-sample predictions of iEEG data. We show that our model produces intuitive state assignments that can help automate clinical analysis of seizures and enable the comparison of sub-clinical bursts and full clinical seizures.

【Keywords】:

42. Optimal rates for stochastic convex optimization under Tsybakov noise condition.

Paper Link】 【Pages】:365-373

【Authors】: Aaditya Ramdas ; Aarti Singh

【Abstract】: We focus on the problem of minimizing a convex function (f) over a convex set (S) given (T) queries to a stochastic first order oracle. We argue that the complexity of convex minimization is only determined by the rate of growth of the function around its minimum (x^_{f,S}), as quantified by a Tsybakov-like noise condition. Specifically, we prove that if (f) grows at least as fast as (|x-x^{f,S}|^\kappa) around its minimum, for some (\kappa > 1), then the optimal rate of learning (f(x^*{f,S})) is (\Theta(T^{-\frac{\kappa}{2\kappa-2}})). The classic rate (\Theta(1/\sqrt T)) for convex functions and (\Theta(1/T)) for strongly convex functions are special cases of our result for (\kappa \rightarrow \infty) and (\kappa=2), and even faster rates are attained for (1 < \kappa < 2). We also derive tight bounds for the complexity of learning (x_{f,S}^*), where the optimal rate is (\Theta(T^{-\frac{1}{2\kappa-2}})). Interestingly, these precise rates also characterize the complexity of active learning and our results further strengthen the connections between the fields of active learning and convex optimization, both of which rely on feedback-driven queries.

【Keywords】:

43. A Randomized Mirror Descent Algorithm for Large Scale Multiple Kernel Learning.

Paper Link】 【Pages】:374-382

【Authors】: Arash Afkanpour ; András György ; Csaba Szepesvári ; Michael Bowling

【Abstract】: We consider the problem of simultaneously learning to linearly combine a very large number of kernels and learn a good predictor based on the learnt kernel. When the number of kernels d to be combined is very large, multiple kernel learning methods whose computational cost scales linearly in d are intractable. We propose a randomized version of the mirror descent algorithm to overcome this issue, under the objective of minimizing the group p-norm penalized empirical risk. The key to achieve the required exponential speed-up is the computationally efficient construction of low-variance estimates of the gradient. We propose importance sampling based estimates, and find that the ideal distribution samples a coordinate with a probability proportional to the magnitude of the corresponding gradient. We show that in the case of learning the coefficients of a polynomial kernel, the combinatorial structure of the base kernels to be combined allows sampling from this distribution in (O(\log(d))) time, making the total computational cost of the method to achieve an epsilon-optimal solution to be (O(\log(d)/epsilon^2)), thereby allowing our method to operate for very large values of d. Experiments with simulated and real data confirm that the new algorithm is computationally more efficient than its state-of-the-art alternatives.

【Keywords】:

44. Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery.

Paper Link】 【Pages】:383-391

【Authors】: Yudong Chen ; Constantine Caramanis

【Abstract】: Many models for sparse regression typically assume that the covariates are known completely, and without noise. Particularly in high-dimensional applications, this is often not the case. Worse yet, even estimating statistics of the noise (the noise covariance) can be a central challenge. In this paper we develop a simple variant of orthogonal matching pursuit (OMP) for precisely this setting. We show that without knowledge of the noise covariance, our algorithm recovers the support, and we provide matching lower bounds that show that our algorithm performs at the minimax optimal rate. While simple, this is the first algorithm that (provably) recovers support in a noise-distribution-oblivious manner. When knowledge of the noise-covariance is available, our algorithm matches the best-known (\ell^2)-recovery bounds available. We show that these too are min-max optimal. Along the way, we also obtain improved performance guarantees for OMP for the standard sparse regression problem with Gaussian noise.

【Keywords】:

45. Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method.

Paper Link】 【Pages】:392-400

【Authors】: Taiji Suzuki

【Abstract】: We develop new stochastic optimization methods that are applicable to a wide range of structured regularizations. Basically our methods are combinations of basic stochastic optimization techniques and Alternating Direction Multiplier Method (ADMM). ADMM is a general framework for optimizing a composite function, and has a wide range of applications. We propose two types of online variants of ADMM, which correspond to online proximal gradient descent and regularized dual averaging respectively. The proposed algorithms are computationally efficient and easy to implement. Our methods yield (O(1/\sqrt{T})) convergence of the expected risk. Moreover, the online proximal gradient descent type method yields (O(\log(T)/T)) convergence for a strongly convex loss. Numerical experiments show effectiveness of our methods in learning tasks with structured sparsity such as overlapped group lasso.

【Keywords】:

46. A New Frontier of Kernel Design for Structured Data.

Paper Link】 【Pages】:401-409

【Authors】: Kilho Shin

【Abstract】: Many kernels for discretely structured data in the literature are designed within the framework of the convolution kernel and its generalization, the mapping kernel. The two most important advantages to use this framework is an easy-to-check criteria of positive definiteness and efficient computation based on the dynamic programming methodology of the resulting kernels. On the other hand, the recent theory of partitionable kernels reveals that the known kernels only take advantage of a very small portion of the potential of the framework. In fact, we have good opportunities to find novel and important kernels in the unexplored area. In this paper, we shed light on a novel important class of kernels within the framework: We give a mathematical characterization of the class, show a parametric method to optimize kernels of the class to specific problems, based on this characterization, and present some experimental results, which show the new kernels are promising in both accuracy and efficiency.

【Keywords】:

47. Learning with Marginalized Corrupted Features.

Paper Link】 【Pages】:410-418

【Authors】: Laurens van der Maaten ; Minmin Chen ; Stephen Tyree ; Kilian Q. Weinberger

【Abstract】: The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on very large (infinite) training data sets that capture all variations in the data distribution. In the case of finite training data, an effective solution is to extend the training set with artificially created examples – which, however, is also computationally costly. We propose to corrupt training examples with noise from known distributions within the exponential family and present a novel learning algorithm, called marginalized corrupted features (MCF), that trains robust predictors by minimizing the expected value of the loss function under the corrupting distribution – essentially learning with infinitely many (corrupted) training examples. We show empirically on a variety of data sets that MCF classifiers can be trained efficiently, may generalize substantially better to test data, and are more robust to feature deletion at test time.

【Keywords】:

48. Approximation properties of DBNs with binary hidden units and real-valued visible units.

Paper Link】 【Pages】:419-426

【Authors】: Oswin Krause ; Asja Fischer ; Tobias Glasmachers ; Christian Igel

【Abstract】: Deep belief networks (DBNs) can approximate any distribution over fixed-length binary vectors. However, DBNs are frequently applied to model real-valued data, and so far little is known about their representational power in this case. We analyze the approximation properties of DBNs with two layers of binary hidden units and visible units with conditional distributions from the exponential family. It is shown that these DBNs can, under mild assumptions, model any additive mixture of distributions from the exponential family with independent variables. An arbitrarily good approximation in terms of Kullback-Leibler divergence of an m-dimensional mixture distribution with n components can be achieved by a DBN with m visible variables and n and n+1 hidden variables in the first and second hidden layer, respectively. Furthermore, relevant infinite mixtures can be approximated arbitrarily well by a DBN with a finite number of neurons. This includes the important special case of an infinite mixture of Gaussian distributions with fixed variance restricted to a compact domain, which in turn can approximate any strictly positive density over this domain.

【Keywords】:

49. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.

Paper Link】 【Pages】:427-435

【Authors】: Martin Jaggi

【Abstract】: We provide stronger and more general primal-dual convergence results for Frank-Wolfe-type algorithms (a.k.a. conditional gradient) for constrained convex optimization, enabled by a simple framework of duality gap certificates. Our analysis also holds if the linear subproblems are only solved approximately (as well as if the gradients are inexact), and is proven to be worst-case optimal in the sparsity of the obtained solutions. On the application side, this allows us to unify a large variety of existing sparse greedy methods, in particular for optimization over convex hulls of an atomic set, even if those sets can only be approximated, including sparse (or structured sparse) vectors or matrices, low-rank matrices, permutation matrices, or max-norm bounded matrices. We present a new general framework for convex optimization over matrix factorizations, where every Frank-Wolfe iteration will consist of a low-rank update, and discuss the broad application areas of this approach.

【Keywords】:

50. General Functional Matrix Factorization Using Gradient Boosting.

Paper Link】 【Pages】:436-444

【Authors】: Tianqi Chen ; Hang Li ; Qiang Yang ; Yong Yu

【Abstract】: Matrix factorization is among the most successful techniques for collaborative filtering. One challenge of collaborative filtering is how to utilize available auxiliary information to improve prediction accuracy. In this paper, we study the problem of utilizing auxiliary information as features of factorization and propose formalizing the problem as general functional matrix factorization, whose model includes conventional matrix factorization models as its special cases. Moreover, we propose a gradient boosting based algorithm to efficiently solve the optimization problem. Finally, we give two specific algorithms for efficient feature function construction for two specific tasks. Our method can construct more suitable feature functions by searching in an infinite functional space based on training data and thus can yield better prediction accuracy. The experimental results demonstrate that the proposed method outperforms the baseline methods on three real-world datasets.

【Keywords】:

51. Iterative Learning and Denoising in Convolutional Neural Associative Memories.

Paper Link】 【Pages】:445-453

【Authors】: Amin Karbasi ; Amir Hesam Salavati ; Amin Shokrollahi

【Abstract】: The task of a neural associative memory is to retrieve a set of previously memorized patterns from their noisy versions by using a network of neurons. Hence, an ideal network should be able to 1) gradually learn a set of patterns, 2) retrieve the correct pattern from noisy queries and 3) maximize the number of memorized patterns while maintaining the reliability in responding to queries. We show that by considering the inherent redundancy in the memorized patterns, one can obtain all the mentioned properties at once. This is in sharp contrast with the previous work that could only improve one or two aspects at the expense of the third. More specifically, we devise an iterative algorithm that learns the redundancy among the patterns. The resulting network has a retrieval capacity that is exponential in the size of the network. Lastly, by considering the local structures of the network, the asymptotic error correction performance can be made linear in the size of the network.

【Keywords】:

52. Scaling Multidimensional Gaussian Processes using Projected Additive Approximations.

Paper Link】 【Pages】:454-461

【Authors】: Elad Gilboa ; Yunus Saatçi ; John P. Cunningham ; Elad Gilboa

【Abstract】: Exact Gaussian Process (GP) regression has (O(N^3)) runtime for data size N, making it intractable for large N. Advances in GP scaling have not been extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests a novel method of projected additive approximation to multidimensional GPs. We thoroughly illustrate the power of this method on several datasets, achieving close performance to the naive Full GP at orders of magnitude less cost.

【Keywords】:

53. Active Learning for Multi-Objective Optimization.

Paper Link】 【Pages】:462-470

【Authors】: Marcela Zuluaga ; Guillaume Sergent ; Andreas Krause ; Markus Püschel

【Abstract】: In many fields one encounters the challenge of identifying, out of a pool of possible designs, those that simultaneously optimize multiple objectives. This means that usually there is not one optimal design but an entire set of Pareto-optimal ones with optimal tradeoffs in the objectives. In many applications, evaluating one design is expensive; thus, an exhaustive search for the Pareto-optimal set is unfeasible. To address this challenge, we propose the Pareto Active Learning (PAL) algorithm, which intelligently samples the design space to predict the Pareto-optimal set. Key features of PAL include (1) modeling the objectives as samples from a Gaussian process distribution to capture structure and accommodate noisy evaluation; (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on PAL’s sampling cost required to achieve a desired accuracy. Further, we show an experimental evaluation on three real-world data sets. The results show PAL’s effectiveness; in particular it improves significantly over a state-of-the-art evolutionary algorithm, saving in many cases about 33%.

【Keywords】:

54. A Generalized Kernel Approach to Structured Output Learning.

Paper Link】 【Pages】:471-479

【Authors】: Hachem Kadri ; Mohammad Ghavamzadeh ; Philippe Preux

【Abstract】: We study the problem of structured output learning from a regression perspective. We first provide a general formulation of the kernel dependency estimation (KDE) approach to this problem using operator-valued kernels. Our formulation overcomes the two main limitations of the original KDE approach, namely the decoupling between outputs in the image space and the inability to use a joint feature space. We then propose a covariance-based operator-valued kernel that allows us to take into account the structure of the kernel feature space. This kernel operates on the output space and only encodes the interactions between the outputs without any reference to the input space. To address this issue, we introduce a variant of our KDE method based on the conditional covariance operator that in addition to the correlation between the outputs takes into account the effects of the input variables. Finally, we evaluate the performance of our KDE approach using both covariance and conditional covariance kernels on three structured output problems, and compare it to the state-of-the art kernel-based structured output regression methods.

【Keywords】:

55. Efficient Active Learning of Halfspaces: an Aggressive Approach.

Paper Link】 【Pages】:480-488

【Authors】: Alon Gonen ; Sivan Sabato ; Shai Shalev-Shwartz

【Abstract】: We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings.

【Keywords】:

56. Enhanced statistical rankings via targeted data collection.

Paper Link】 【Pages】:489-497

【Authors】: Braxton Osting ; Christoph Brune ; Stanley Osher

【Abstract】: Given a graph where vertices represent alternatives and pairwise comparison data, (y{ij}), is given on the edges, the statistical ranking problem is to find a potential function, defined on the vertices, such that the gradient of the potential function agrees with pairwise comparisons. We study the dependence of the statistical ranking problem on the available pairwise data, i.e., pairs (i,j) for which the pairwise comparison data (y{ij}) is known, and propose a framework to identify data which, when augmented with the current dataset, maximally increases the Fisher information of the ranking. Under certain assumptions, the data collection problem decouples, reducing to a problem of finding an edge set on the graph (with a fixed number of edges) such that the second eigenvalue of the graph Laplacian is maximal. This reduction of the data collection problem to a spectral graph-theoretic question is one of the primary contributions of this work. As an application, we study the Yahoo! Movie user rating dataset and demonstrate that the addition of a small number of well-chosen pairwise comparisons can significantly increase the Fisher informativeness of the ranking.

【Keywords】:

57. Online Feature Selection for Model-based Reinforcement Learning.

Paper Link】 【Pages】:498-506

【Authors】: Trung Thanh Nguyen ; Zhuoru Li ; Tomi Silander ; Tze-Yun Leong

【Abstract】: We propose a new framework for learning the world dynamics of feature-rich environments in model-based reinforcement learning. The main idea is formalized as a new, factored state-transition representation that supports efficient online-learning of the relevant features. We construct the transition models through predicting how the actions change the world. We introduce an online sparse coding learning technique for feature selection in high-dimensional spaces. We derive theoretical guarantees for our framework and empirically demonstrate its practicality in both simulated and real robotics domains.

【Keywords】:

58. ELLA: An Efficient Lifelong Learning Algorithm.

Paper Link】 【Pages】:507-515

【Authors】: Paul Ruvolo ; Eric Eaton

【Abstract】: The problem of learning multiple consecutive tasks, known as lifelong learning, is of great importance to the creation of intelligent, general-purpose, and flexible machines. In this paper, we develop a method for online multi-task learning in the lifelong learning setting. The proposed Efficient Lifelong Learning Algorithm (ELLA) maintains a sparsely shared basis for all task models, transfers knowledge from the basis to learn each new task, and refines the basis over time to maximize performance across all tasks. We show that ELLA has strong connections to both online dictionary learning for sparse coding and state-of-the-art batch multi-task learning methods, and provide robust theoretical performance guarantees. We show empirically that ELLA yields nearly identical performance to batch multi-task learning while learning tasks sequentially in three orders of magnitude (over 1,000x) less time.

【Keywords】:

59. A Structural SVM Based Approach for Optimizing Partial AUC.

Paper Link】 【Pages】:516-524

【Authors】: Harikrishna Narasimhan ; Shivani Agarwal

【Abstract】: The area under the ROC curve (AUC) is a widely used performance measure in machine learning. Increasingly, however, in several applications, ranging from ranking and biometric screening to medical diagnosis, performance is measured not in terms of the full area under the ROC curve, but instead, in terms of the partial area under the ROC curve between two specified false positive rates. In this paper, we develop a structural SVM framework for directly optimizing the partial AUC between any two false positive rates. Our approach makes use of a cutting plane solver along the lines of the structural SVM based approach for optimizing the full AUC developed by Joachims (2005). Unlike the full AUC, where the combinatorial optimization problem needed to find the most violated constraint in the cutting plane solver can be decomposed easily to yield an efficient algorithm, the corresponding optimization problem in the case of partial AUC is harder to decompose. One of our key technical contributions is an efficient algorithm for solving this combinatorial optimization problem that has the same computational complexity as Joachims’ algorithm for optimizing the usual AUC. This allows us to efficiently optimize the partial AUC in any desired false positive range. We demonstrate the approach on a variety of real-world tasks.

【Keywords】:

60. Convex Relaxations for Learning Bounded-Treewidth Decomposable Graphs.

Paper Link】 【Pages】:525-533

【Authors】: K. S. Sesh Kumar ; Francis R. Bach

【Abstract】: We consider the problem of learning the structure of undirected graphical models with bounded treewidth, within the maximum likelihood framework. This is an NP-hard problem and most approaches consider local search techniques. In this paper, we pose it as a combinatorial optimization problem, which is then relaxed to a convex optimization problem that involves searching over the forest and hyperforest polytopes with special structures. A supergradient method is used to solve the dual problem, with a run-time complexity of (O(k^3 n^{k+2} \log n)) for each iteration, where (n) is the number of variables and (k) is a bound on the treewidth. We compare our approach to state-of-the-art methods on synthetic datasets and classical benchmarks, showing the gains of the novel convex approach.

【Keywords】:

61. Adaptive Task Assignment for Crowdsourced Classification.

Paper Link】 【Pages】:534-542

【Authors】: Chien-Ju Ho ; Shahin Jabbari ; Jennifer Wortman Vaughan

【Abstract】: Crowdsourcing markets have gained popularity as a tool for inexpensively collecting data from diverse populations of workers. Classification tasks, in which workers provide labels (such as “offensive” or “not offensive”) for instances (such as websites), are among the most common tasks posted, but due to a mix of human error and the overwhelming prevalence of spam, the labels collected are often noisy. This problem is typically addressed by collecting labels for each instance from multiple workers and combining them in a clever way. However, the question of how to choose which tasks to assign to each worker is often overlooked. We investigate the problem of task assignment and label inference for heterogeneous classification tasks. By applying online primal-dual techniques, we derive a provably near-optimal adaptive assignment algorithm. We show that adaptively assigning workers to tasks can lead to more accurate predictions at a lower cost when the available workers are diverse.

【Keywords】:

62. Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning.

Paper Link】 【Pages】:543-551

【Authors】: Odalric-Ambrym Maillard ; Phuong Nguyen ; Ronald Ortner ; Daniil Ryabko

【Abstract】: We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order (O(T^{2/3})) with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after T time steps is (O(\sqrt{T})), with all constants reasonably small. This is optimal in T since (O(\sqrt{T})) is the optimal regret in the setting of learning in a (single discrete) MDP.

【Keywords】:

63. Better Mixing via Deep Representations.

Paper Link】 【Pages】:552-560

【Authors】: Yoshua Bengio ; Grégoire Mesnil ; Yann Dauphin ; Salah Rifai

【Abstract】: It has been hypothesized, and supported with experimental evidence, that deeper representations, when well trained, tend to do a better job at disentangling the underlying factors of variation. We study the following related conjecture: better representations, in the sense of better disentangling, can be exploited to produce Markov chains that mix faster between modes. Consequently, mixing between modes would be more efficient at higher levels of representation. To better understand this, we propose a secondary conjecture: the higher-level samples fill more uniformly the space they occupy and the high-density manifolds tend to unfold when represented at higher levels. The paper discusses these hypotheses and tests them experimentally through visualization and measurements of mixing between modes and interpolating between samples.

【Keywords】:

64. Online Latent Dirichlet Allocation with Infinite Vocabulary.

Paper Link】 【Pages】:561-569

【Authors】: Ke Zhai ; Jordan L. Boyd-Graber

【Abstract】: Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.

【Keywords】:

65. Characterizing the Representer Theorem.

Paper Link】 【Pages】:570-578

【Authors】: Yaoliang Yu ; Hao Cheng ; Dale Schuurmans ; Csaba Szepesvári

【Abstract】: The representer theorem assures that kernel methods retain optimality under penalized empirical risk minimization. While a sufficient condition on the form of the regularizer guaranteeing the representer theorem has been known since the initial development of kernel methods, necessary conditions have only been investigated recently. In this paper we completely characterize the necessary and sufficient conditions on the regularizer that ensure the representer theorem holds. The results are surprisingly simple yet broaden the conditions where the representer theorem is known to hold. Extension to the matrix domain is also addressed.

【Keywords】:

66. Dynamical Models and tracking regret in online convex programming.

Paper Link】 【Pages】:579-587

【Authors】: Eric C. Hall ; Rebecca Willett

【Abstract】: This paper describes a new online convex optimization method which incorporates a family of candidate dynamical models and establishes novel tracking regret bounds that scale with comparator’s deviation from the best dynamical model in this family. Previous online optimization methods are designed to have a total accumulated loss comparable to that of the best comparator sequence, and existing tracking or shifting regret bounds scale with the overall variation of the comparator sequence. In many practical scenarios, however, the environment is nonstationary and comparator sequences with small variation are quite weak, resulting in large losses. The proposed dynamic mirror descent method, in contrast, can yield low regret relative to highly variable comparator sequences by both tracking the best dynamical model and forming predictions based on that model. This concept is demonstrated empirically in the context of sequential compressive observations of a dynamic scene and tracking a dynamic social network.

【Keywords】:

67. Large-Scale Bandit Problems and KWIK Learning.

Paper Link】 【Pages】:588-596

【Authors】: Jacob Abernethy ; Kareem Amin ; Michael Kearns ; Moez Draief

【Abstract】: We show that parametric multi-armed bandit (MAB) problems with large state and action spaces can be algorithmically reduced to the supervised learning model known as Knows What It Knows or KWIK learning. We give matching impossibility results showing that the KWIK learnability requirement cannot be replaced by weaker supervised learning assumptions. We provide such results in both the standard parametric MAB setting, as well as for a new model in which the action space is finite but growing with time.

【Keywords】:

68. Vanishing Component Analysis.

Paper Link】 【Pages】:597-605

【Authors】: Roi Livni ; David Lehavi ; Sagi Schein ; Hila Nachlieli ; Shai Shalev-Shwartz ; Amir Globerson

【Abstract】: The vanishing ideal of a set of n points S, is the set of all polynomials that attain the value of zero on all the points in S. Such ideals can be compactly represented using a small set of polynomials known as generators of the ideal. Here we describe and analyze an efficient procedure that constructs a set of generators of a vanishing ideal. Our procedure is numerically stable, and can be used to find approximately vanishing polynomials. The resulting polynomials capture nonlinear structure in data, and can for example be used within supervised learning. Empirical comparison with kernel methods show that our method constructs more compact classifiers with comparable accuracy.

【Keywords】:

69. Learning an Internal Dynamics Model from Control Demonstration.

Paper Link】 【Pages】:606-614

【Authors】: Matthew Golub ; Steven Chase ; Byron Yu

【Abstract】: Much work in optimal control and inverse control has assumed that the controller has perfect knowledge of plant dynamics. However, if the controller is a human or animal subject, the subject’s internal dynamics model may differ from the true plant dynamics. Here, we consider the problem of learning the subject’s internal model from demonstrations of control and knowledge of task goals. Due to sensory feedback delay, the subject uses an internal model to generate an internal prediction of the current plant state, which may differ from the actual plant state. We develop a probabilistic framework and exact EM algorithm to jointly estimate the internal model, internal state trajectories, and feedback delay. We applied this framework to demonstrations by a nonhuman primate of brain-machine interface (BMI) control. We discovered that the subject’s internal model deviated from the true BMI plant dynamics and provided significantly better explanation of the recorded neural control signals than did the true plant dynamics.

【Keywords】:

70. Robust Structural Metric Learning.

Paper Link】 【Pages】:615-623

【Authors】: Daryl Lim ; Gert R. G. Lanckriet ; Brian McFee

【Abstract】: Metric learning algorithms produce a linear transformation of data which is optimized for a prediction task, such as nearest-neighbor classification or ranking. However, when the input data contains a large portion of non-informative features, existing methods fail to identify the relevant features, and performance degrades accordingly. In this paper, we present an efficient and robust structural metric learning algorithm which enforces group sparsity on the learned transformation, while optimizing for structured ranking output prediction. Experiments on synthetic and real datasets demonstrate that the proposed method outperforms previous methods in both high- and low-noise settings.

【Keywords】:

71. Constrained fractional set programs and their application in local clustering and community detection.

Paper Link】 【Pages】:624-632

【Authors】: Thomas Bühler ; Syama Sundar Rangapuram ; Simon Setzer ; Matthias Hein

【Abstract】: The (constrained) minimization of a ratio of set functions is a problem frequently occurring in clustering and community detection. As these optimization problems are typically NP-hard, one uses convex or spectral relaxations in practice. While these relaxations can be solved globally optimally, they are often too loose and thus lead to results far away from the optimum. In this paper we show that every constrained minimization problem of a ratio of non-negative set functions allows a tight relaxation into an unconstrained continuous optimization problem. This result leads to a flexible framework for solving constrained problems in network analysis. While a globally optimal solution for the resulting non-convex problem cannot be guaranteed, we outperform the loose convex or spectral relaxations by a large margin on constrained local clustering problems.

【Keywords】:

72. Efficient Semi-supervised and Active Learning of Disjunctions.

Paper Link】 【Pages】:633-641

【Authors】: Nina Balcan ; Christopher Berlind ; Steven Ehrlich ; Yingyu Liang

【Abstract】: We provide efficient algorithms for learning disjunctions in the semi-supervised setting under a natural regularity assumption introduced by (Balcan & Blum, 2005). We prove bounds on the sample complexity of our algorithms under a mild restriction on the data distribution. We also give an active learning algorithm with improved sample complexity and extend all our algorithms to the random classification noise setting.

【Keywords】:

73. Convex Adversarial Collective Classification.

Paper Link】 【Pages】:642-650

【Authors】: MohamadAli Torkamani ; Daniel Lowd

【Abstract】: In this paper, we present a novel method for robustly performing collective classification in the presence of a malicious adversary that can modify up to a fixed number of binary-valued attributes. Our method is formulated as a convex quadratic program that guarantees optimal weights against a worst-case adversary in polynomial time. In addition to increased robustness against active adversaries, this kind of adversarial regularization can also lead to improved generalization even when no adversary is present. In experiments on real and simulated data, our method consistently outperforms both non-adversarial and non-relational baselines.

【Keywords】:

74. Rounding Methods for Discrete Linear Classification.

Paper Link】 【Pages】:651-659

【Authors】: Yann Chevaleyre ; Frédéric Koriche ; Jean-Daniel Zucker

【Abstract】: Learning discrete linear functions is a notoriously difficult challenge. In this paper, the learning task is cast as combinatorial optimization problem: given a set of positive and negative feature vectors in the Euclidean space, the goal is to find a discrete linear function that minimizes the cumulative hinge loss of this training set. Since this problem is NP-hard, we propose two simple rounding algorithms that discretize the fractional solution of the problem. Generalization bounds are derived for two important classes of binary-weighted linear functions, by establishing the Rademacher complexity of these classes and proving approximation bounds for rounding methods. These methods are compared on both synthetic and real-world data.

【Keywords】:

Cycle 2 Papers 42

75. Mixture of Mutually Exciting Processes for Viral Diffusion.

Paper Link】 【Pages】:1-9

【Authors】: Shuang-Hong Yang ; Hongyuan Zha

【Abstract】: Diffusion network inference and meme tracking have been two key challenges in viral diffusion. This paper shows that these two tasks can be addressed simultaneously with a probabilistic model involving a mixture of mutually exciting point processes. A fast learning algorithms is developed based on mean-field variational inference with budgeted diffusion bandwidth. The model is demonstrated with applications to the diffusion of viral texts in (1) online social networks (e.g., Twitter) and (2) the blogosphere on the Web.

【Keywords】:

76. Gaussian Process Vine Copulas for Multivariate Dependence.

Paper Link】 【Pages】:10-18

【Authors】: David Lopez-Paz ; José Miguel Hernández-Lobato ; Zoubin Ghahramani

【Abstract】: Copulas allow to learn marginal distributions separately from the multivariate dependence structure (copula) that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables. We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data.

【Keywords】:

77. Stochastic Simultaneous Optimistic Optimization.

Paper Link】 【Pages】:19-27

【Authors】: Michal Valko ; Alexandra Carpentier ; Rémi Munos

【Abstract】: We study the problem of global maximization of a function f given a finite number of evaluations perturbed by noise. We consider a very weak assumption on the function, namely that it is locally smooth (in some precise sense) with respect to some semi-metric, around one of its global maxima. Compared to previous works on bandits in general spaces (Kleinberg et al., 2008; Bubeck et al., 2011a) our algorithm does not require the knowledge of this semi-metric. Our algorithm, StoSOO, follows an optimistic strategy to iteratively construct upper confidence bounds over the hierarchical partitions of the function domain to decide which point to sample next. A finite-time analysis of StoSOO shows that it performs almost as well as the best specifically-tuned algorithms even though the local smoothness of the function is not known.

【Keywords】:

78. Toward Optimal Stratification for Stratified Monte-Carlo Integration.

Paper Link】 【Pages】:28-36

【Authors】: Alexandra Carpentier ; Rémi Munos

【Abstract】: We consider the problem of adaptive stratified sampling for Monte Carlo integration of a function, given a finite number of function evaluations perturbed by noise. Here we address the problem of adapting simultaneously the number of samples into each stratum and the stratification itself. We show a tradeoff in the size of the partitioning. On the one hand it is important to refine the partition in areas where the observation noise or the function are heterogeneous in order to reduce this variability. But on the other hand, a too refined stratification makes it harder to assign the samples according to a near-optimal (oracle) allocation strategy. In this paper we provide an algorithm Monte-Carlo Upper-Lower Confidence Bound that selects online, among a large class of partitions, the partition that provides a near-optimal trade-off, and allocates the samples almost optimally on this partition.

【Keywords】:

79. A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems.

Paper Link】 【Pages】:37-45

【Authors】: Pinghua Gong ; Changshui Zhang ; Zhaosong Lu ; Jianhua Huang ; Jieping Ye

【Abstract】: Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge. A commonly used approach is the Multi-Stage (MS) convex relaxation (or DC programming), which relaxes the original non-convex problem to a sequence of convex problems. This approach is usually not very practical for large-scale problems because its computational cost is a multiple of solving a single convex problem. In this paper, we propose a General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of non-convex penalties. The GIST algorithm iteratively solves a proximal operator problem, which in turn has a closed-form solution for many commonly used penalties. At each outer iteration of the algorithm, we use a line search initialized by the Barzilai-Borwein (BB) rule that allows finding an appropriate step size quickly. The paper also presents a detailed convergence analysis of the GIST algorithm. The efficiency of the proposed algorithm is demonstrated by extensive experiments on large-scale data sets.

【Keywords】:

80. Thurstonian Boltzmann Machines: Learning from Multiple Inequalities.

Paper Link】 【Pages】:46-54

【Authors】: Truyen Tran ; Dinh Q. Phung ; Svetha Venkatesh

【Abstract】: We introduce Thurstonian Boltzmann Machines (TBM), a unified architecture that can naturally incorporate a wide range of data inputs at the same time. Our motivation rests in the Thurstonian view that many discrete data types can be considered as being generated from a subset of underlying latent continuous variables, and in the observation that each realisation of a discrete type imposes certain inequalities on those variables. Thus learning and inference in TBM reduce to making sense of a set of inequalities. Our proposed TBM naturally supports the following types: Gaussian, intervals, censored, binary, categorical, muticategorical, ordinal, (in)-complete rank with and without ties. We demonstrate the versatility and capacity of the proposed model on three applications of very different natures; namely handwritten digit recognition, collaborative filtering and complex social survey analysis.

【Keywords】:

81. A Variational Approximation for Topic Modeling of Hierarchical Corpora.

Paper Link】 【Pages】:55-63

【Authors】: Do-kyum Kim ; Geoffrey M. Voelker ; Lawrence K. Saul

【Abstract】: We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that there exists a simple variational approximation for probabilistic inference. The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model’s hierarchy. We compare our approach to existing implementations of nonparametric HDPs. On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods. Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security–one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy.

【Keywords】:

82. Forecastable Component Analysis.

Paper Link】 【Pages】:64-72

【Authors】: Georg M. Goerg

【Abstract】: I introduce Forecastable Component Analysis (ForeCA), a novel dimension reduction technique for temporally dependent signals. Based on a new forecastability measure, ForeCA finds an optimal transformation to separate a multivariate time series into a forecastable and an orthogonal white noise space. I present a converging algorithm with a fast eigenvector solution. Applications to fi nancial and macro-economic time series show that ForeCA can successfully discover informative structure, which can be used for forecasting as well as classi cation. The R package ForeCA accompanies this work and is publicly available on CRAN.

【Keywords】:

83. Ellipsoidal Multiple Instance Learning.

Paper Link】 【Pages】:73-81

【Authors】: Gabriel Krummenacher ; Cheng Soon Ong ; Joachim M. Buhmann

【Abstract】: We propose a large margin method for asymmetric learning with ellipsoids, called eMIL, suited to multiple instance learning (MIL). We derive the distance between ellipsoids and the hyperplane, generalising the standard support vector machine. Negative bags in MIL contain only negative instances, and we treat them akin to uncertain observations in the robust optimisation framework. However, our method allows positive bags to cross the margin, since it is not known which instances within are positive. We show that representing bags as ellipsoids under the introduced distance is the most robust solution when treating a bag as a random variable with finite mean and covariance. Two algorithms are derived to solve the resulting non-convex optimization problem: a concave-convex procedure and a quasi-Newton method. Our method achieves competitive results on benchmark datasets. We introduce a MIL dataset from a real world application of detecting wheel defects from multiple partial observations, and show that eMIL outperforms competing approaches.

【Keywords】:

84. Local Low-Rank Matrix Approximation.

Paper Link】 【Pages】:82-90

【Authors】: Joonseok Lee ; Seungyeon Kim ; Guy Lebanon ; Yoram Singer

【Abstract】: Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of low-rank. We propose a new matrix approximation model where we assume instead that the matrix is locally of low-rank, leading to a representation of the observed matrix as a weighted sum of low-rank matrices. We analyze the accuracy of the proposed local low-rank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.

【Keywords】:

85. Generic Exploration and K-armed Voting Bandits.

Paper Link】 【Pages】:91-99

【Authors】: Tanguy Urvoy ; Fabrice Clérot ; Raphaël Feraud ; Sami Naamane

【Abstract】: We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this setting is to offer a natural generalization of dueling bandits for situations where the environment parameters reflect the idiosyncratic preferences of a mixed crowd.

【Keywords】:

86. A unifying framework for vector-valued manifold regularization and multi-view learning.

Paper Link】 【Pages】:100-108

【Authors】: Ha Quang Minh ; Loris Bazzani ; Vittorio Murino

【Abstract】: This paper presents a general vector-valued reproducing kernel Hilbert spaces (RKHS) formulation for the problem of learning an unknown functional dependency between a structured input space and a structured output space, in the Semi-Supervised Learning setting. Our formulation includes as special cases Vector-valued Manifold Regularization and Multi-view Learning, thus provides in particular a unifying framework linking these two important learning approaches. In the case of least square loss function, we provide a closed form solution with an efficient implementation. Numerical experiments on challenging multi-class categorization problems show that our multi-view learning formulation achieves results which are comparable with state of the art and are significantly better than single-view learning.

【Keywords】:

87. Learning Connections in Financial Time Series.

Paper Link】 【Pages】:109-117

【Authors】: Gartheeban Ganeshapillai ; John V. Guttag ; Andrew Lo

【Abstract】: To reduce risk, investors seek assets that have high expected return and are unlikely to move in tandem. Correlation measures are generally used to quantify the connections between equities. The 2008 financial crisis, and its aftermath, demonstrated the need for a better way to quantify these connections. We present a machine learning-based method to build a connectedness matrix to address the shortcomings of correlation in capturing events such as large losses. Our method uses an unconstrained optimization to learn this matrix, while ensuring that the resulting matrix is positive semi-definite. We show that this matrix can be used to build portfolios that not only “beat the market,” but also outperform optimal (i.e., minimum variance) portfolios.

【Keywords】:

88. Fast dropout training.

Paper Link】 【Pages】:118-126

【Authors】: Sida I. Wang ; Christopher D. Manning

【Abstract】: Preventing feature co-adaptation by encouraging independent contributions from different features often improves classification and regression performance. Dropout training (Hinton et al., 2012) does this by randomly dropping out (zeroing) hidden units and input features during training of neural networks. However, repeatedly sampling a random subset of input features makes training much slower. Based on an examination of the implied objective function of dropout training, we show how to do fast dropout training by sampling from or integrating a Gaussian approximation, instead of doing Monte Carlo optimization of this objective. This approximation, justified by the central limit theorem and empirical evidence, gives an order of magnitude speedup and more stability. We show how to do fast dropout training for classification, regression, and multilayer neural networks. Beyond dropout, our technique is extended to integrate out other types of noise and small image transformations.

【Keywords】:

89. Scalable Optimization of Neighbor Embedding for Visualization.

Paper Link】 【Pages】:127-135

【Authors】: Zhirong Yang ; Jaakko Peltonen ; Samuel Kaski

【Abstract】: Neighbor embedding (NE) methods have found their use in data visualization but are limited in big data analysis tasks due to their O(n2) complexity for n data samples. We demonstrate that the obvious approach of subsampling produces inferior results and propose a generic approximated optimization technique that reduces the NE optimization cost to O(n log n). The technique is based on realizing that in visualization the embedding space is necessarily very low-dimensional (2D or 3D), and hence efficient approximations developed for n-body force calculations can be applied. In gradient-based NE algorithms the gradient for an individual point decomposes into “forces” exerted by the other points. The contributions of close-by points need to be computed individually but far-away points can be approximated by their “center of mass”, rapidly computable by applying a recursive decomposition of the visualization space into quadrants. The new algorithm brings a significant speed-up for medium-size data, and brings “big data” within reach of visualization.

【Keywords】:

90. Precision-recall space to correct external indices for biclustering.

Paper Link】 【Pages】:136-144

【Authors】: Blaise Hanczar ; Mohamed Nadif

【Abstract】: Biclustering is a major tool of data mining in many domains and many algorithms have emerged in recent years. All these algorithms aim to obtain coherent biclusters and it is crucial to have a reliable procedure for their validation. We point out the problem of size bias in biclustering evaluation and show how it can lead to wrong conclusions in a comparative study. We present the theoretical corrections for all of the most popular measures in order to remove this bias. We introduce the corrected precision-recall space that combines the advantages of corrected measures, the ease of interpretation and visualization of uncorrected measures. Numerical experiments demonstrate the interest of our approach.

【Keywords】:

91. Monochromatic Bi-Clustering.

Paper Link】 【Pages】:145-153

【Authors】: Sharon Wulff ; Ruth Urner ; Shai Ben-David

【Abstract】: We propose a natural cost function for the bi-clustering task, the monochromatic cost. This cost function is suitable for detecting meaningful homogeneous bi-clusters based on categorical valued input matrices. Such tasks arise in many applications, such as the analysis of social networks and in systems-biology where researchers try to infer functional grouping of biological agents based on their pairwise interactions. We analyze the computational complexity of the resulting optimization problem. We present a polynomial time approximation algorithm for this bi-clustering task and complement this result by showing that finding (exact) optimal solutions is NP-hard. As far as we know, these are the first positive approximation guarantees and formal NP-hardness results for any bi-clustering optimization problem. In addition, we show that our optimization problem can be efficiently solved by deterministic annealing, yielding a promising heuristic for large problem instances.

【Keywords】:

92. Gated Autoencoders with Tied Input Weights.

Paper Link】 【Pages】:154-162

【Authors】: Alain Droniou ; Olivier Sigaud

【Abstract】: The semantic interpretation of images is one of the core applications of deep learning. Several techniques have been recently proposed to model the relation between two images, with application to pose estimation, action recognition or invariant object recognition. Among these techniques, higher-order Boltzmann machines or relational autoencoders consider projections of the images on different subspaces and intermediate layers act as transformation specific detectors. In this work, we extend the mathematical study of (Memisevic, 2012b) to show that it is possible to use a unique projection for both images in a way that turns intermediate layers as spectrum encoders of transformations. We show that this results in networks that are easier to tune and have greater generalization capabilities.

【Keywords】:

93. Strict Monotonicity of Sum of Squares Error and Normalized Cut in the Lattice of Clusterings.

Paper Link】 【Pages】:163-171

【Authors】: Nicola Rebagliati

【Abstract】: Sum of Squares Error and Normalized Cut are two widely used clustering functional. It is known their minimum values are monotone with respect to the input number of clusters and this monotonicity does not allow for a simple automatic selection of a correct number of clusters. Here we study monotonicity not just on the minimizers but on the entire clustering lattice. We show the value of Sum of Squares Error is strictly monotone under the strict refinement relation of clusterings and we obtain data-dependent bounds on the difference between the value of a clustering and one of its refinements. Using analogous techniques we show the value of Normalized Cut is strictly anti-monotone. These results imply that even if we restrict our solutions to form a chain of clustering, like the one we get from hierarchical algorithms, we cannot rely on the functional values in order to choose the number of clusters. By using these results we get some data-dependent bounds on the difference of the values of any two clusterings.

【Keywords】:

94. Transition Matrix Estimation in High Dimensional Time Series.

Paper Link】 【Pages】:172-180

【Authors】: Fang Han ; Han Liu

【Abstract】: In this paper, we propose a new method in estimating transition matrices of high dimensional vector autoregressive (VAR) models. Here the data are assumed to come from a stationary Gaussian VAR time series. By formulating the problem as a linear program, we provide a new approach to conduct inference on such models. In theory, under a doubly asymptotic framework in which both the sample size T and dimensionality d of the time series can increase, we provide explicit rates of convergence between the estimator and the population transition matrix under different matrix norms. Our results show that the spectral norm of the transition matrix plays a pivotal role in determining the final rates of convergence. This is the first work analyzing the estimation of transition matrices under a high dimensional doubly asymptotic framework. Experiments are conducted on both synthetic and real-world stock data to demonstrate the effectiveness of the proposed method compared with the existing methods. The results of this paper have broad impact on different applications, including finance, genomics, and brain imaging.

【Keywords】:

95. Label Partitioning For Sublinear Ranking.

Paper Link】 【Pages】:181-189

【Authors】: Jason Weston ; Ameesh Makadia ; Hector Yee

【Abstract】: We consider the case of ranking a very large set of labels, items, or documents, which is common to information retrieval, recommendation, and large-scale annotation tasks. We present a general approach for converting an algorithm which has linear time in the size of the set to a sublinear one via label partitioning. Our method consists of learning an input partition and a label assignment to each partition of the space such that precision at k is optimized, which is the loss function of interest in this setting. Experiments on large-scale ranking and recommendation tasks show that our method not only makes the original linear time algorithm computationally tractable, but can also improve its performance.

【Keywords】:

96. Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passing.

Paper Link】 【Pages】:190-198

【Authors】: Huayan Wang ; Daphne Koller

【Abstract】: Max-product (max-sum) message passing algorithms are widely used for MAP inference in MRFs. It has many variants sharing a common flavor of passing