Machine Learning

  1. Multi-view predictive partitioning in high dimensions.

    Authors: Giovanni Montana, Brian McWilliams
    Subjects: Machine Learning
    Abstract

    Many modern data mining applications are concerned with the analysis of
    datasets in which the observations are described by paired high-dimensional
    vectorial representations or "views". Some typical examples can be found in web
    mining and genomics applications.

  2. Minimax Rates of Estimation for Sparse PCA in High Dimensions.

    Authors: Jing Lei, Vincent Q. Vu
    Subjects: Machine Learning
    Abstract

    We study sparse principal components analysis in the high-dimensional
    setting, where $p$ (the number of variables) can be much larger than $n$ (the
    number of observations). We prove optimal, non-asymptotic lower and upper
    bounds on the minimax estimation error for the leading eigenvector when it
    belongs to an $\ell_q$ ball for $q \in [0,1]$. Our bounds are sharp in $p$ and
    $n$ for all $q \in [0, 1]$ over a wide class of distributions. The upper bound
    is obtained by analyzing the performance of $\ell_q$-constrained PCA.

  3. Robust recovery of multiple subspaces by geometric l_p minimization.

    Authors: Teng Zhang, Gilad Lerman
    Subjects: Machine Learning
    Abstract

    We assume i.i.d. data sampled from a mixture distribution with K components
    along fixed d-dimensional linear subspaces and an additional outlier component.
    For p>0, we study the simultaneous recovery of the K fixed subspaces by
    minimizing the l_p-averaged distances of the sampled data points from any K
    subspaces. Under some conditions, we show that if $0<p\leq1$, then all
    underlying subspaces can be precisely recovered by l_p minimization with
    overwhelming probability.

  4. Adaptive Policies for Sequential Sampling under Incomplete Information and a Cost Constraint.

    Authors: Apostolos Burnetas, Odysseas Kanavetas
    Subjects: Machine Learning
    Abstract

    We consider the problem of sequential sampling from a finite number of
    independent statistical populations to maximize the expected infinite horizon
    average outcome per period, under a constraint that the expected average
    sampling cost does not exceed an upper bound. The outcome distributions are not
    known. We construct a class of consistent adaptive policies, under which the
    average outcome converges with probability 1 to the true value under complete
    information for all distributions with finite means.

  5. Spike-and-Slab Sparse Coding for Unsupervised Feature Discovery.

    Authors: Aaron Courville, Yoshua Bengio, Ian J. Goodfellow
    Subjects: Machine Learning
    Abstract

    We consider the problem of using a factor model we call {\em spike-and-slab
    sparse coding} (S3C) to learn features for a classification task. The S3C model
    resembles both the spike-and-slab RBM and sparse coding. Since exact inference
    in this model is intractable, we derive a structured variational inference
    procedure and employ a variational EM training algorithm. Prior work on
    approximate inference for this model has not prioritized the ability to exploit
    parallel architectures and scale to enormous problem sizes.

  6. A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process.

    Authors: David M. Blei, Chong Wang
    Subjects: Machine Learning
    Abstract

    The hierarchical Dirichlet process (HDP) has become an important Bayesian
    nonparametric model for grouped data, such as document collections. The HDP is
    used to construct a flexible mixed-membership model where the number of
    components is determined by the data. As for most Bayesian nonparametric
    models, exact posterior inference is intractable---practitioners use Markov
    chain Monte Carlo (MCMC) or variational inference.

  7. The Interaction of Entropy-Based Discretization and Sample Size: An Empirical Study.

    Authors: Casey Bennett
    Subjects: Machine Learning
    Abstract

    An empirical investigation of the interaction of sample size and
    discretization - in this case the entropy-based method CAIM (Class-Attribute
    Interdependence Maximization) - was undertaken to evaluate the impact and
    potential bias introduced into data mining performance metrics due to variation
    in sample size as it impacts the discretization process. Of particular interest
    was the effect of discretizing within cross-validation folds averse to outside
    discretization folds.

  8. Sparse Nonparametric Graphical Models.

    Authors: Larry Wasserman, Han Liu, John Lafferty
    Subjects: Machine Learning
    Abstract

    We present some nonparametric methods for graphical modeling. In the discrete
    case, where the data are binary or drawn from a finite alphabet, Markov random
    fields are already essentially nonparametric, since the cliques can take only a
    finite number of values. Continuous data are different. The Gaussian graphical
    model is the standard parametric model for continuous data, but it makes
    distributional assumptions that are often unrealistic. We discuss two
    approaches to building more flexible graphical models.

  9. Bayesian Active Learning for Classification and Preference Learning.

    Authors: Zoubin Ghahramani, Neil Houlsby, Ferenc Husz&#xe1;r, M&#xe1;t&#xe9; Lengyel
    Subjects: Machine Learning
    Abstract

    Information theoretic active learning has been widely studied for
    probabilistic models. For simple regression an optimal myopic policy is easily
    tractable. However, for other tasks and with more complex models, such as
    classification with nonparametric models, the optimal solution is harder to
    compute. Current approaches make approximations to achieve tractability. We
    propose an approach that expresses information gain in terms of predictive
    entropies, and apply this method to the Gaussian Process Classifier (GPC).

  10. Bilateral Random Projections.

    Authors: Tianyi Zhou, Dacheng Tao
    Subjects: Machine Learning
    Abstract

    Low-rank structure have been profoundly studied in data mining and machine
    learning. In this paper, we show a dense matrix $X$'s low-rank approximation
    can be rapidly built from its left and right random projections $Y_1=XA_1$ and
    $Y_2=X^TA_2$, or bilateral random projection (BRP). We then show power scheme
    can further improve the precision. The deterministic, average and deviation
    bounds of the proposed method and its power scheme modification are proved
    theoretically.

  11. A Novel M-Estimator for Robust PCA.

    Authors: Teng Zhang, Gilad Lerman
    Subjects: Machine Learning
    Abstract

    We formulate a convex minimization to robustly recover a subspace from a
    contaminated data set, partially sampled around it, and propose a fast
    iterative algorithm to achieve the corresponding minimum. We establish exact
    recovery by this minimizer, quantify the effect of noise and regularization,
    explain how to take advantage of a known intrinsic dimension and establish
    linear convergence of the iterative algorithm.

  12. Reduced Rank Multivariate Generalized Linear Models for Feature Extraction.

    Authors: Yiyuan She
    Subjects: Machine Learning
    Abstract

    Supervised linear feature extraction corresponds to fitting a reduced rank
    multivariate model. This paper studies rank penalized and rank constrained
    multivariate generalized linear models. From the perspective of thresholding
    rules, we build a framework for fitting singular value penalized models and use
    it for feature extraction. Through solving the rank constraint form of the
    problem, we propose progressive feature space reduction for fast computation in
    high dimensions with little performance loss.

  13. Beta-Negative Binomial Process and Poisson Factor Analysis.

    Authors: David Dunson, Lawrence Carin, Mingyuan Zhou, Lauren Hannah
    Subjects: Machine Learning
    Abstract

    A beta-negative binomial (BNB) process is proposed, leading to a
    beta-gamma-Poisson process, which may be viewed as a "multi-scoop"
    generalization of the beta-Bernoulli process. The BNB process is augmented into
    a beta-gamma-gamma-Poisson hierarchical structure, and applied as a
    nonparametric Bayesian prior for an infinite Poisson factor analysis model. A
    finite approximation for the beta process Levy random measure is constructed
    for convenient implementation. Efficient MCMC computations are performed with
    data augmentation and marginalization techniques.

  14. Robust Learning via Cause-Effect Models.

    Authors: Bernhard Sch&#xf6;lkopf, Dominik Janzing, Jonas Peters, Kun Zhang
    Subjects: Machine Learning
    Abstract

    We consider the problem of function estimation in the case where the data
    distribution may shift between training and test time, and additional
    information about it may be available at test time. This relates to popular
    scenarios such as covariate shift, concept drift, transfer learning and
    semi-supervised learning. This working paper discusses how these tasks could be
    tackled depending on the kind of changes of the distributions. It argues that
    knowledge of an underlying causal direction can facilitate several of these
    tasks.

  15. On a Rapid Simulation of the Dirichlet Process.

    Authors: Luai Al Labadi, Mahmoud Zarepour
    Subjects: Machine Learning
    Abstract

    We describe a simple and efficient procedure for approximating the L\'evy
    measure of a $\text{Gamma}(\alpha,1)$ random variable. We use this
    approximation to derive a finite sum-representation that converges almost
    surely to Ferguson's representation of the Dirichlet process based on arrivals
    of a homogeneous Poisson process. We compare the efficiency of our
    approximation to several other well known approximations of the Dirichlet
    process and demonstrate a substantial improvement.

  16. Truncated Power Method for Sparse Eigenvalue Problems.

    Authors: Tong Zhang, Xiao-Tong Yuan
    Subjects: Machine Learning
    Abstract

    This paper considers the sparse eigenvalue problem, which is to extract
    dominant (largest) sparse eigenvectors with at most $k$ non-zero components. We
    propose a simple yet effective solution called truncated power method that can
    approximately solve the underlying nonconvex optimization problem. A strong
    sparse recovery result is proved for the truncated power method, and this
    theory is our key motivation for developing the new algorithm. The proposed
    method is tested on applications such as sparse principal component analysis
    and the densest $k$-subgraph problem.

  17. Convergent Expectation Propagation in Linear Models with Spike-and-slab Priors.

    Authors: Jos&#xe9; Miguel Hern&#xe1;ndez-Lobato, Daniel Hern&#xe1;ndez-Lobato
    Subjects: Machine Learning
    Abstract

    Exact inference in the linear regression model with spike and slab priors is
    often intractable. Expectation propagation (EP) can be used for approximate
    inference. However, the regular sequential form of EP (R-EP) may fail to
    converge in this model when the size of the training set is very small. As an
    alternative, we propose a provably convergent EP algorithm (PC-EP).

  18. Adaptive Forgetting Factor Fictitious Play.

    Authors: David S. Leslie, Michalis Smyrnakis
    Subjects: Machine Learning
    Abstract

    It is now well known that decentralised optimisation can be formulated as a
    potential game, and game-theoretical learning algorithms can be used to find an
    optimum. One of the most common learning techniques in game theory is
    fictitious play. However fictitious play is founded on an implicit assumption
    that opponents' strategies are stationary. We present a novel variation of
    fictitious play that allows the use of a more realistic model of opponent
    strategy.

  19. Graph Construction for Learning with Unbalanced Data.

    Authors: Manqi Zhao, Venkatesh Saligrama, Jing Qian
    Subjects: Machine Learning
    Abstract

    Unbalanced data arises in many learning tasks such as clustering of
    multi-class data, hierarchical divisive clustering and semisupervised learning.
    Graph-based approaches are popular tools for these problems. Graph construction
    is an important aspect of graph-based learning. We show that graph-based
    algorithms can fail for unbalanced data for many popular graphs such as k-NN,
    \epsilon-neighborhood and full-RBF graphs. We propose a novel graph
    construction technique that encodes global statistical information into node
    degrees through a ranking scheme.

  20. Asynchronous Stochastic Approximation with Differential Inclusions.

    Authors: Steven Perkins, David S. Leslie
    Subjects: Machine Learning
    Abstract

    The asymptotic pseudo-trajectory approach to stochastic approximation of
    Benaim, Hofbauer and Sorin is extended for asynchronous stochastic
    approximations with a set-valued mean field. The asynchronicity of the process
    is incorporated into the mean field to produce convergence results which remain
    similar to those of an equivalent synchronous process. In addition, this allows
    many of the restrictive assumptions previously associated with asynchronous
    stochastic approximation to be removed.

  21. Entropy Search for Information-Efficient Global Optimization.

    Authors: Philipp Hennig, Christian J. Schuler
    Subjects: Machine Learning
    Abstract

    Contemporary global optimization algorithms are based on local measures of
    utility, rather than a probability measure over location and value of the
    optimum. They thus attempt to collect low function values, not to learn about
    the optimum. The reason for the absence of probabilistic global optimizers is
    that the corresponding inference problem is intractable in several ways.

  22. Learning Nonlinear Functions Using Regularized Greedy Forest.

    Authors: Tong Zhang, Rie Johnson
    Subjects: Machine Learning
    Abstract

    We apply the concept of structured sparsity to improve boosted decision trees
    with general loss functions. The existing approach to this problem is
    Friedman's gradient boosting procedure using a regression tree base learner.
    Although this method has led to many successful industrial applications, it
    suffers from several theoretical and practical drawbacks. By employing the idea
    of structured greedy search, we are able to design a regularized greedy forest
    procedure to address these issues.

  23. Machine Learning with Operational Costs.

    Authors: Theja Tulabandhula, Cynthia Rudin
    Subjects: Machine Learning
    Abstract

    This work concerns the way that statistical models are used to make
    decisions. In particular, we aim to merge the way estimation algorithms are
    designed with how they are used for a subsequent task. Our methodology
    considers the operational cost of carrying out a policy, based on a predictive
    model. The operational cost becomes a regularization term in the learning
    algorithm's objective function, allowing either an \textit{optimistic} or
    \textit{pessimistic} view of possible costs.

  24. Information-Maximization Clustering based on Squared-Loss Mutual Information.

    Authors: Masashi Sugiyama, Makoto Yamada, Hirotaka Hachiya, Manabu Kimura
    Subjects: Machine Learning
    Abstract

    Information-maximization clustering learns a probabilistic classifier in an
    unsupervised manner so that mutual information between feature vectors and
    cluster assignments is maximized. A notable advantage of this approach is that
    it only involves continuous optimization of model parameters, which is
    substantially easier to solve than discrete optimization of cluster
    assignments. However, existing methods still involve non-convex optimization
    problems, and therefore finding a good local optimal solution is not
    straightforward in practice.

  25. Mask Iterative Hard Thresholding Algorithms for Sparse Image Reconstruction of Objects with Known Contour.

    Authors: Kun Qiu, Aleksandar Dogandzic, Renliang Gu
    Subjects: Machine Learning
    Abstract

    We develop mask iterative hard thresholding algorithms (mask IHT and mask
    DORE) for sparse image reconstruction of objects with known contour. The
    measurements follow a noisy underdetermined linear model common in the
    compressive sampling literature. Assuming that the contour of the object that
    we wish to reconstruct is known and that the signal outside the contour is
    zero, we formulate a constrained residual squared error minimization problem
    that incorporates both the geometric information (i.e. the knowledge of the
    object's contour) and the signal sparsity constraint.

  26. Joint estimation of linear non-Gaussian acyclic models.

    Authors: Shohei Shimizu
    Subjects: Machine Learning
    Abstract

    A linear non-Gaussian structural equation model called LiNGAM is an
    identifiable model for exploratory causal analysis. Previous methods estimate a
    causal ordering of variables and their connection strengths based on a single
    dataset. However, in many application domains, data are obtained under
    different conditions, that is, multiple datasets are obtained rather than a
    single dataset.

  27. Structure Learning of Probabilistic Graphical Models: A Comprehensive Survey.

    Authors: Yang Zhou
    Subjects: Machine Learning
    Abstract

    Probabilistic graphical models combine the graph theory and probability
    theory to give a multivariate statistical modeling. They provide a unified
    description of uncertainty using probability and complexity using the graphical
    model. Especially, graphical models provide the following several useful
    properties:

    - Graphical models provide a simple and intuitive interpretation of the
    structures of probabilistic models. On the other hand, they can be used to
    design and motivate new models.

  28. Efficient Adaptive Compressive Sensing Using Sparse Hierarchical Learned Dictionaries.

    Authors: Jarvis Haupt, Akshay Soni
    Subjects: Machine Learning
    Abstract

    Recent breakthrough results in compressed sensing (CS) have established that
    many high dimensional objects can be accurately recovered from a relatively
    small number of non- adaptive linear projection observations, provided that the
    objects possess a sparse representation in some basis.

  29. Approximate Gaussian Integration using Expectation Propagation.

    Authors: Philipp Hennig, Simon Lacoste-Julien, John P. Cunningham
    Subjects: Machine Learning
    Abstract

    While Gaussian probability densities are omnipresent in applied mathematics,
    Gaussian cumulative probabilities are hard to calculate in any but the
    univariate case. We offer here an empirical study of the utility of Expectation
    Propagation (EP) as an approximate integration method for this problem. For
    rectangular integration regions, the approximation is highly accurate. We also
    extend the derivations to the more general case of polyhedral integration
    regions. However, we find that in this polyhedral case, EP's answer, though
    often accurate, can be almost arbitrarily wrong.

  30. A kernel-based framework for learning graded relations from data.

    Authors: Tapio Pahikkala, Antti Airola, Tapio Salakoski, Willem Waegeman, Michiel Stock, Bernard De Baets
    Subjects: Machine Learning
    Abstract

    Driven by a large number of potential applications in areas like
    bioinformatics, information retrieval and social network analysis, the problem
    setting of inferring relations between pairs of data objects has recently been
    investigated quite intensively in the machine learning community. To this end,
    current approaches typically consider datasets containing crisp relations, so
    that standard classification methods can be adopted. However, relations between
    objects like similarities and preferences are often expressed in a graded
    manner in real-world applications.

  31. Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm.

    Authors: Fionn Murtagh, Pierre Legendre
    Subjects: Machine Learning
    Abstract

    The Ward error sum of squares hierarchical clustering method has been very
    widely used since its first description by Ward in a 1963 publication. It has
    also been generalized in various ways. However there are different
    interpretations in the literature and there are different implementations of
    the Ward agglomerative algorithm in commonly used software systems, including
    differing expressions of the agglomerative criterion. Our survey work and case
    studies will be useful for all those involved in developing software for data
    analysis using Ward's hierarchical clustering method.

  32. Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Machine Learning
    Abstract

    We describe many vantage points on the Baire metric and its use in clustering
    data, or its use in preprocessing and structuring data in order to support
    search and retrieval operations. In some cases, we proceed directly to clusters
    and do not directly determine the distances. We show how a hierarchical
    clustering can be read directly from one pass through the data. We offer
    insights also on practical implications of precision of data measurement. As a
    mechanism for treating multidimensional data, including very high dimensional
    data, we use random projections.

  33. Additive Covariance Kernels for High-Dimensional Gaussian Process Modeling.

    Authors: Laurent Carraro, David Ginsbourger, Nicolas Durrande, Olivier Roustant
    Subjects: Machine Learning
    Abstract

    Gaussian process models -also called Kriging models- are often used as
    mathematical approximations of expensive experiments. However, the number of
    observation required for building an emulator becomes unrealistic when using
    classical covariance kernels when the dimension of input increases. In oder to
    get round the curse of dimensionality, a popular approach is to consider
    simplified models such as additive models.

  34. Optimal exponential bounds on the accuracy of classification.

    Authors: N.I. Pentacaput
    Subjects: Machine Learning
    Abstract

    We consider a standard binary classification problem. The performance of any
    binary classifier based on the training data is characterized by the excess
    risk. We study Bahadur's type exponential bounds on the minimax accuracy
    confidence function based on the excess risk. We study how this quantity
    depends on the complexity of the class of distributions characterized by
    exponents of entropies of the class of regression functions or of the class of
    Bayes classifiers corresponding to the distributions from the class.

  35. Automatic Relevance Determination in Nonnegative Matrix Factorization with the beta-Divergence.

    Authors: Vincent Y. F. Tan, C&#xe9;dric F&#xe9;votte
    Subjects: Machine Learning
    Abstract

    This paper addresses the problem of estimating the latent dimensionality in
    nonnegative matrix factorization (NMF). The estimation is done via automatic
    relevance determination (ARD). Uncovering the model order is important as it is
    necessary to strike the right balance between data fidelity and overfitting. We
    propose a Bayesian model for NMF and two families of algorithms known as l1-ARD
    and l2-ARD, each assuming different priors on the basis elements and the
    activation coefficients.

  36. On l_1 Mean and Variance Filtering.

    Authors: Bo Wahlberg, Cristian R. Rojas, Mariette Annergren
    Subjects: Machine Learning
    Abstract

    This paper addresses the problem of segmenting a time-series with respect to
    changes in the mean value or in the variance. The first case is when the time
    data is modeled as a sequence of independent and normal distributed random
    variables with unknown, possibly changing, mean value but fixed variance. The
    main assumption is that the mean value is piecewise constant in time, and the
    task is to estimate the change times and the mean values within the segments.
    The second case is when the mean value is constant, but the variance can
    change.

  37. Receiver Architectures for MIMO-OFDM Based on a Combined VMP-SP Algorithm.

    Authors: Carles Navarro Manch&#xf3;n, Gunvor E. Kirkelund, Erwin Riegler, Lars P. B. Christensen, Bernard H. Fleury
    Subjects: Machine Learning
    Abstract

    Iterative information processing, either based on heuristics or analytical
    frameworks, has been shown to be a very powerful tool for the design of
    efficient, yet feasible, wireless receiver architectures. Within this context,
    algorithms performing message-passing on a probabilistic graph, such as the
    sum-product (SP) and variational message passing (VMP) algorithms, have become
    increasingly popular.

  38. Falsification and future performance.

    Authors: David Balduzzi
    Subjects: Machine Learning
    Abstract

    We information-theoretically reformulate two measures of capacity from
    statistical learning theory: empirical VC-entropy and empirical Rademacher
    complexity. We show these capacity measures count the number of hypotheses
    about a dataset that a learning algorithm falsifies when it finds the
    classifier in its repertoire minimizing empirical risk. It then follows from
    that the future performance of predictors on unseen data is controlled in part
    by how many hypotheses the learner falsifies.

  39. The Graphical Lasso: New Insights and Alternatives.

    Authors: Trevor Hastie, Rahul Mazumder
    Subjects: Machine Learning
    Abstract

    The graphical lasso [Banerjee et al., 2008, Friedman et al., 2007b] is a
    popular approach for learning the structure in an undirected Gaussian graphical
    model, using $\ell_1$ regularization to control the number of zeros in the
    precision matrix ?${\B\Theta}={\B\Sigma}^{-1}$. The R package glasso is
    popular, fast, and allows one to efficiently build a path of models for
    different values of the tuning parameter. Convergence of GLASSO can be tricky;
    the converged precision matrix might not be the inverse of the estimated
    covariance, and occasionally it fails to converge with warm starts.

  40. Krylov Subspace Descent for Deep Learning.

    Authors: Oriol Vinyals, Daniel Povey
    Subjects: Machine Learning
    Abstract

    In this paper, we propose a second order optimization method to learn models
    where both the dimensionality of the parameter space and the number of training
    samples is high. In our method, we construct on each iteration a Krylov
    subspace formed by the gradient and an approximation to the Hessian matrix, and
    then use a subset of the training data samples to optimize over this subspace.
    As with the Hessian Free (HF) method of [7], the Hessian matrix is never
    explicitly constructed, and is computed using a subset of data.

  41. Fast Learning Rate of Non-Sparse Multiple Kernel Learning and Optimal Regularization Strategies.

    Authors: Taiji Suzuki
    Subjects: Machine Learning
    Abstract

    In this paper, we give a new generalization error bound of Multiple Kernel
    Learning (MKL) for a general class of regularizations, and discuss what kind of
    regularization gives a favorable predictive accuracy. Our main target in this
    paper is dense type regularizations including \ellp-MKL. According to the
    recent numerical experiments, the sparse regularization does not necessarily
    show a good performance compared with dense type regularizations.

  42. Maximum Joint Entropy and Information-Based Collaboration of Automated Learning Machines.

    Authors: N. K. Malakar, K. H. Knuth, D. J. Lary
    Subjects: Machine Learning
    Abstract

    We are working to develop automated intelligent agents, which can act and
    react as learning machines with minimal human intervention. To accomplish this,
    an intelligent agent is viewed as a question-asking machine, which is designed
    by coupling the processes of inference and inquiry to form a model-based
    learning unit. In order to select maximally-informative queries, the
    intelligent agent needs to be able to compute the relevance of a question.

  43. Estimated VC dimension for risk bounds.

    Authors: Cosma Rohilla Shalizi, Daniel J. McDonald, Mark Schervish
    Subjects: Machine Learning
    Abstract

    Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the
    generalization capacity of learning algorithms. However, apart from a few
    special cases, it is hard or impossible to calculate analytically. Vapnik et
    al. [10] proposed a technique for estimating the VC dimension empirically.
    While their approach behaves well in simulations, it could not be used to bound
    the generalization risk of classifiers, because there were no bounds for the
    estimation error of the VC dimension itself.

  44. A note on the lack of symmetry in the graphical lasso.

    Authors: Bala Rajaratnam, Benjamin T. Rolfs
    Subjects: Machine Learning
    Abstract

    The graphical lasso (glasso) is a widely-used fast algorithm for estimating
    sparse inverse covariance matrices. The glasso solves an L_1 penalized maximum
    likelihood problem and is implemented on CRAN. The output from the glasso, a
    regularized covariance matrix estimate Sigma_glasso and a sparse inverse
    covariance matrix estimate Omega_glasso, not only identify a graphical model
    but can also serve as intermediate inputs into multivariate procedures such as
    PCA, LDA, MANOVA, and others.

  45. The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy.

    Authors: Nancy Heckman
    Subjects: Machine Learning
    Abstract

    The popular cubic smoothing spline estimate of a regression function arises
    as the minimizer of the penalized sum of squares $\sum_j(Y_j - {\mu}(t_j))^2 +
    {\lambda}\int_a^b [{\mu}"(t)]^2 dt$, where the data are $t_j,Y_j$, $j=1,...,
    n$. The minimization is taken over an infinite-dimensional function space, the
    space of all functions with square integrable second derivatives. But the
    calculations can be carried out in a finite-dimensional space.

  46. Estimation of scale functions to model heteroscedasticity by support vector machines.

    Authors: Robert Hable, Andreas Christmann
    Subjects: Machine Learning
    Abstract

    A main goal of regression is to derive statistical conclusions on the
    conditional distribution of the output variable Y given the input values x. Two
    of the most important characteristics of a single distribution are location and
    scale. Support vector machines (SVMs) are well established to estimate location
    functions like the conditional median or the conditional mean. We investigate
    the estimation of scale functions by SVMs when the conditional median is
    unknown, too. Estimation of scale functions is important e.g. to estimate the
    volatility in finance.

  47. Robust PCA as Bilinear Decomposition with Outlier-Sparsity Regularization.

    Authors: Georgios B. Giannakis, Gonzalo Mateos
    Subjects: Machine Learning
    Abstract

    Principal component analysis (PCA) is widely used for dimensionality
    reduction, with well-documented merits in various applications involving
    high-dimensional data, including computer vision, preference measurement, and
    bioinformatics. In this context, the fresh look advocated here permeates
    benefits from variable selection and compressive sampling, to robustify PCA
    against outliers.

  48. UPAL: Unbiased Pool Based Active Learning.

    Authors: Ravi Ganti, Alexander Gray
    Subjects: Machine Learning
    Abstract

    In this paper we address the problem of pool based active learning, and
    provide an algorithm, called UPAL, that works by minimizing the unbiased
    estimator of the risk of a hypothesis in a given hypothesis space. For the
    space of linear classifiers and the squared loss we show that UPAL is
    equivalent to an exponentially weighted average forecaster. Exploiting some
    recent results regarding the spectra of random matrices allows us to establish
    consistency of UPAL when the true hypothesis is a linear hypothesis.

  49. Discriminant Analysis with Adaptively Pooled Covariance.

    Authors: Noah Simon, Rob Tibshirani
    Subjects: Machine Learning
    Abstract

    Linear and Quadratic Discriminant analysis (LDA/QDA) are common tools for
    classification problems. For these methods we assume observations are normally
    distributed within group. We estimate a mean and covariance matrix for each
    group and classify using Bayes theorem. With LDA, we estimate a single, pooled
    covariance matrix, while for QDA we estimate a separate covariance matrix for
    each group. Rarely do we believe in a homogeneous covariance structure between
    groups, but often there is insufficient data to separately estimate covariance
    matrices.

  50. Bayesian Causal Induction.

    Authors: Pedro A. Ortega
    Subjects: Machine Learning
    Abstract

    Discovering causal relationships is a hard task, often hindered by the need
    for intervention, and often requiring large amounts of data to resolve
    statistical uncertainty. However, humans quickly arrive at useful causal
    relationships. One possible reason is that humans use strong prior knowledge;
    and rather than encoding hard causal relationships, they encode beliefs over
    causal structures, allowing for sound generalization from the observations they
    obtain from directly acting in the world.

  51. Structural Similarity and Distance in Learning.

    Authors: Venkatesh Saligrama, Joseph Wang, David A. Casta&#xf1;&#xf3;n
    Subjects: Machine Learning
    Abstract

    We propose a novel method of introducing structure into existing machine
    learning techniques by developing structure-based similarity and distance
    measures. To learn structural information, low-dimensional structure of the
    data is captured by solving a non-linear, low-rank representation problem. We
    show that this low-rank representation can be kernelized, has a closed-form
    solution, allows for separation of independent manifolds, and is robust to
    noise.

  52. Context tree selection and linguistic rhythm retrieval from written texts.

    Authors: Jesus E. Garcia, Antonio Galves, Nancy L. Garcia, Florencia Leonardi, Charlotte Galves
    Subjects: Machine Learning
    Abstract

    The starting point of this article is the question "How to retrieve
    fingerprints of rhythm in written texts?". We address this problem in the case
    of Brazilian and European Portuguese. These two dialects of Modern Portuguese
    share the same lexicon and most of the sentences they produce are superficially
    identical. Yet they are conjectured, on linguistic grounds, to implement
    different rhythms. We show that this linguistic question can be formulated as a
    problem of model selection in the class of variable length Markov chains.

  53. A Flexible, Scalable and Efficient Algorithmic Framework for Primal Graphical Lasso.

    Authors: Rahul Mazumder, Deepak K. Agarwal
    Subjects: Machine Learning
    Abstract

    We propose a scalable, efficient and statistically motivated computational
    framework for Graphical Lasso (Friedman et al., 2007b) - a covariance
    regularization framework that has received significant attention in the
    statistics community over the past few years. Existing algorithms have trouble
    in scaling to dimensions larger than a thousand. Our proposal significantly
    enhances the state-of-the-art for such moderate sized problems and gracefully
    scales to larger problems where other algorithms become practically infeasible.
    This requires a few key new ideas.

  54. Distance Dependent Infinite Latent Feature Models.

    Authors: David M. Blei, Peter I. Frazier, Samuel J. Gershman
    Subjects: Machine Learning
    Abstract

    Latent feature models are widely used to decompose data into a small number
    of components. Bayesian nonparametric variants of these models, which use the
    Indian buffet process (IBP) as a prior over latent features, allow the number
    of features to be determined from the data. We present a generalization of the
    IBP, the distance dependent Indian buffet process (dd-IBP), for modeling
    non-exchangeable data. It relies on a distance function defined between data
    points, biasing nearby data to share more features.

  55. Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs.

    Authors: Hyokun Yun, S. V. N. Vishwanathan
    Subjects: Machine Learning
    Abstract

    We describe the first sub-quadratic sampling algorithm for the Multiplicative
    Attribute Graph Model (MAGM) of Kim and Leskovec (2010). We exploit the close
    connection between MAGM and the Kronecker Product Graph Model (KPGM) of
    Leskovec et al. (2010), and show that to sample a graph from a MAGM it suffices
    to sample small number of KPGM graphs and \emph{quilt} them together. Under
    mild technical conditions our algorithm runs in $O((\log_2(n))^3 |E|)$ time,
    where $n$ is the number of nodes and $|E|$ is the number of edges in the
    sampled graph.

  56. Algebraic Representation of Probability Distributions.

    Authors: Klaus-Robert M&#xfc;ller, Franz Johannes Kir&#xe1;ly, Paul von B&#xfc;nau, Jan Saputra M&#xfc;ller, Duncan Blythe, Frank Meinecke
    Subjects: Machine Learning
    Abstract

    We show that the use of techniques from algebra and algebraic geometry can be
    highly beneficial for tackling machine learning problems, where the set of
    desired solutions can be described in terms of approximate polynomial
    equations. Namely, they allow one to directly solve learning problems using
    algebraic operations thus avoiding iterative optimization which may converge to
    local minima. In this spirit, we suggest a novel representation for probability
    distributions in terms of elements in the polynomial ring derived from
    estimated cumulants.

  57. Gaussian Process Regression Networks.

    Authors: Zoubin Ghahramani, Andrew Gordon Wilson, David A. Knowles
    Subjects: Machine Learning
    Abstract

    We introduce a new regression framework, Gaussian process regression networks
    (GPRN), which combines the structural properties of Bayesian neural networks
    with the non-parametric flexibility of Gaussian processes. This model
    accommodates input dependent signal and noise correlations between multiple
    response variables, input dependent length-scales and amplitudes, and
    heavy-tailed predictive distributions. We derive both efficient Markov chain
    Monte Carlo and variational Bayes inference procedures for this model.

  58. Bayesian Group Factor Analysis.

    Authors: Samuel Kaski, Seppo Virtanen, Arto Klami, Suleiman A. Khan
    Subjects: Machine Learning
    Abstract

    We introduce a factor analysis model that summarizes the dependencies between
    observed variable groups, instead of dependencies between individual variables
    as standard factor analysis does. A group may correspond to one view of the
    same set of objects, one of many data sets tied by co-occurrence, or a set of
    alternative variables collected from statistics tables to measure one property
    of interest.

  59. Efficient Latent Variable Graphical Model Selection via Split Bregman Method.

    Authors: Gui-Bo Ye, Xiaohui Xie, Yuanfeng Wang, Yifei Chen
    Subjects: Machine Learning
    Abstract

    We consider the problem of covariance matrix estimation in the presence of
    latent variables. Under suitable conditions, it is possible to learn the
    marginal covariance matrix of the observed variables via a tractable convex
    program, where the concentration matrix of the observed variables is decomposed
    into a sparse matrix (representing the graphical structure of the observed
    variables) and a low rank matrix (representing the marginalization effect of
    latent variables). We present an efficient first-order method based on split
    Bregman to solve the convex problem.

  60. Discovering Emerging Topics in Social Streams via Link Anomaly Detection.

    Authors: Ryota Tomioka, Toshimitsu Takahashi, Kenji Yamanishi
    Subjects: Machine Learning
    Abstract

    Detection of emerging topics are now receiving renewed interest motivated by
    the rapid growth of social networks. Conventional term-frequency-based
    approaches may not be appropriate in this context, because the information
    exchanged are not only texts but also images, URLs, and videos. We focus on the
    social aspects of theses networks. That is, the links between users that are
    generated dynamically intentionally or unintentionally through replies,
    mentions, and retweets.

  61. The Generalization Ability of Online Algorithms for Dependent Data.

    Authors: Alekh Agarwal, John C. Duchi
    Subjects: Machine Learning
    Abstract

    We study the generalization performance of arbitrary online learning
    algorithms trained on samples coming from a dependent source of data. We show
    that the generalization error of any stable online algorithm concentrates
    around its regret--an easily computable statistic of the online performance of
    the algorithm--when the underlying ergodic process is $\beta$- or
    $\phi$-mixing.

  62. Ground Metric Learning.

    Authors: Marco Cuturi, David Avis
    Subjects: Machine Learning
    Abstract

    Transportation distances have been used for more than a decade now in machine
    learning to compare histograms of features. They have one parameter: the ground
    metric, which can be any metric between the features themselves. As is the case
    for all parameterized distances, transportation distances can only prove useful
    in practice when this parameter is carefully chosen. To date, the only option
    available to practitioners to set the ground metric parameter was to rely on a
    priori knowledge of the features, which limited considerably the scope of
    application of transportation distances.

  63. On the trade-off between complexity and correlation decay in structural learning algorithms.

    Authors: Andrea Montanari, Jos&#xe9; Bento
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning the structure of Ising models (pairwise
    binary Markov random fields) from i.i.d. samples. While several methods have
    been proposed to accomplish this task, their relative merits and limitations
    remain somewhat obscure. By analyzing a number of concrete examples, we show
    that low-complexity algorithms often fail when the Markov random field develops
    long-range correlations. More precisely, this phenomenon appears to be related
    to the Ising model phase transition (although it does not coincide with it).

  64. Identifying relationships between drugs and medical conditions: winning experience in the Challenge 2 of the OMOP 2010 Cup.

    Authors: Vladimir Nikulin
    Subjects: Machine Learning
    Abstract

    There is a growing interest in using a longitudinal observational databases
    to detect drug safety signal. In this paper we present a novel method, which we
    used online during the OMOP Cup. We consider homogeneous ensembling, which is
    based on random re-sampling (known, also, as bagging) as a main innovation
    compared to the previous publications in the related field. This study is based
    on a very large simulated database of the 10 million patients records, which
    was created by the Observational Medical Outcomes Partnership (OMOP).

  65. The Discrete Infinite Logistic Normal Distribution.

    Authors: John Paisley, Chong Wang, David Blei
    Subjects: Machine Learning
    Abstract

    We present the discrete infinite logistic normal distribution (DILN), a
    Bayesian nonparametric prior for mixed membership models. DILN is a
    generalization of the hierarchical Dirichlet process (HDP) that models
    correlation structure between the weights of the atoms at the group level. We
    derive a representation of DILN as a normalized collection of gamma-distributed
    random variables, and study its statistical properties. We consider
    applications to topic modeling and derive a variational inference algorithm for
    approximate posterior inference.

  66. Comparing Probabilistic Models for Melodic Sequences.

    Authors: Amos Storkey, Athina Spiliopoulou
    Subjects: Machine Learning
    Abstract

    Modelling the real world complexity of music is a challenge for machine
    learning. We address the task of modeling melodic sequences from the same music
    genre. We perform a comparative analysis of two probabilistic models; a
    Dirichlet Variable Length Markov Model (Dirichlet-VMM) and a Time Convolutional
    Restricted Boltzmann Machine (TC-RBM). We show that the TC-RBM learns
    descriptive music features, such as underlying chords and typical melody
    transitions and dynamics.

  67. Towards Optimal Learning of Chain Graphs.

    Authors: Jose M. Pe&#xf1;a
    Subjects: Machine Learning
    Abstract

    In this paper, we extend Meek's conjecture (Meek 1997) from directed and
    acyclic graphs to chain graphs, and prove that the extended conjecture is true.
    Specifically, we prove that if a chain graph H is an independence map of the
    independence model induced by another chain graph G, then (i) G can be
    transformed into H by a sequence of directed and undirected edge additions and
    feasible splits and mergings, and (ii) after each operation in the sequence H
    remains an independence map of the independence model induced by G.

  68. Tree Exploration for Bayesian RL Exploration.

    Authors: Christos Dimitrakakis
    Subjects: Machine Learning
    Abstract

    Research in reinforcement learning has produced algorithms for optimal
    decision making under uncertainty that fall within two main types. The first
    employs a Bayesian framework, where optimality improves with increased
    computational time. This is because the resulting planning task takes the form
    of a dynamic programming problem on a belief tree with an infinite number of
    states. The second type employs relatively simple algorithm which are shown to
    suffer small regret within a distribution-free framework.

  69. Mixtures of conditional Gaussian scale mixtures applied to multiscale image representations.

    Authors: Lucas Theis, Matthias Bethge, Reshad Hosseini
    Subjects: Machine Learning
    Abstract

    We present a probabilistic model for natural images which is based on
    Gaussian scale mixtures and a simple multiscale representation. In contrast to
    the dominant approach to modeling whole images focusing on Markov random
    fields, we formulate our model in terms of a directed graphical model. We show
    that it is able to generate images with interesting higher-order correlations
    when trained on natural images or samples from an occlusion based model.

  70. Modern hierarchical, agglomerative clustering algorithms.

    Authors: Daniel M&#xfc;llner
    Subjects: Machine Learning
    Abstract

    This paper presents algorithms for hierarchical, agglomerative clustering
    which perform most efficiently in the general-purpose setup that is given in
    modern standard software. Requirements are: (1) the input data is given by
    pairwise dissimilarities between data points, but extensions to vector data are
    also discussed (2) the output is a "stepwise dendrogram", a data structure
    which is shared by all implementations in current standard software.

  71. Non-parametric Link Prediction.

    Authors: Michael Jordan, Purnamrita Sarkar, Deepayan Chakrabarti
    Subjects: Machine Learning
    Abstract

    We propose a non-parametric link prediction algorithm for a sequence of graph
    snapshots over time. The model predicts links based on the features of its
    endpoints, as well as those of the local neighborhood around the endpoints.
    This allows for different types of neighborhoods in a graph, each with its own
    dynamics (e.g, growing or shrinking communities). We prove the consistency of
    our estimator, and give a fast implementation based on locality-sensitive
    hashing.

  72. Variable Selection in High Dimensions with Random Designs and Orthogonal Matching Pursuit.

    Authors: Antony Joseph
    Subjects: Machine Learning
    Abstract

    The performance of Orthogonal Matching Pursuit (OMP) for variable selection
    is analyzed for random designs. When contrasted with the deterministic case,
    since the performance is here measured after averaging over the distribution of
    the design matrix, one can have far less stringent sparsity constraints on the
    coefficient vector. We demonstrate that for exact sparse vectors, the
    performance of the OMP is similar to known results on the Lasso algorithm
    [\textit{IEEE Trans. Inform. Theory} \textbf{55} (2009) 2183--2202].

  73. Dimension Reduction Using Rule Ensemble Machine Learning Methods: A Numerical Study of Three Ensemble Methods.

    Authors: Orianna DeMasi, Juan Meza, David H. Bailey
    Subjects: Machine Learning
    Abstract

    Ensemble methods for supervised machine learning have become popular due to
    their ability to accurately predict class labels with groups of simple,
    lightweight "base learners." While ensembles offer computationally efficient
    models that have good predictive capability they tend to be large and offer
    little insight into the patterns or structure in a dataset. We consider an
    ensemble technique that returns a model of ranked rules. The model accurately
    predicts class labels and has the advantage of indicating which parameter
    constraints are most useful for predicting those labels.

  74. Prediction of peptide bonding affinity: kernel methods for nonlinear modeling.

    Authors: Charles Bergeron, Theresa Hepburn, C. Matthew Sundling, Michael Krein, Bill Katt, Nagamani Sukumar, Curt M. Breneman, Kristin P. Bennett
    Subjects: Machine Learning
    Abstract

    This paper presents regression models obtained from a process of blind
    prediction of peptide binding affinity from provided descriptors for several
    distinct datasets as part of the 2006 Comparative Evaluation of Prediction
    Algorithms (COEPRA) contest. This paper finds that kernel partial least
    squares, a nonlinear partial least squares (PLS) algorithm, outperforms PLS,
    and that the incorporation of transferable atom equivalent features improves
    predictive capability.

  75. Semi-supervised logistic discrimination via labeled data and unlabeled data from different sampling distributions.

    Authors: Shuichi Kawano
    Subjects: Machine Learning
    Abstract

    This article addresses the problem of classification method based on both
    labeled and unlabeled data, where we assume that a density function for labeled
    data is different from that for unlabeled data. We propose a semi-supervised
    logistic regression model for classification problem along with the technique
    of covariate shift adaptation. Unknown parameters involved in proposed models
    are estimated by regularization with EM algorithm. A crucial issue in modeling
    process is the choices of tuning parameters in our semi-supervised logistic
    models.

  76. A General Theory of Concave Regularization for High Dimensional Sparse Estimation Problems.

    Authors: Tong Zhang, Cun-Hui Zhang
    Subjects: Machine Learning
    Abstract

    Concave regularization methods provide natural procedures for sparse
    recovery. However, they are difficult to analyze in the high dimensional
    setting. Only recently a few sparse recovery results have been established for
    some specific local solutions obtained via specialized numerical procedures.
    Still, the fundamental relationship between these solutions such as whether
    they are identical or their relationship to the global minimizer of the
    underlying nonconvex formulation is unknown.

  77. Using Supervised Learning to Improve Monte Carlo Integral Estimation.

    Authors: Brendan Tracey, David Wolpert, Juan J. Alonso
    Subjects: Machine Learning
    Abstract

    Monte Carlo (MC) techniques are often used to estimate integrals of a
    multivariate function using randomly generated samples of the function. In
    light of the increasing interest in uncertainty quantification and robust
    design applications in aerospace engineering, the calculation of expected
    values of such functions (e.g. performance measures) becomes important.
    However, MC techniques often suffer from high variance and slow convergence as
    the number of samples increases.

  78. Sparse Estimation using Bayesian Hierarchical Prior Modeling for Real and Complex Models.

    Authors: Niels Lovmand Pedersen, Dmitriy Shutin, Carles Navarro Manch&#xf3;n, Bernard Henri Fleury
    Subjects: Machine Learning
    Abstract

    Sparse modeling and estimation of complex signals is not uncommon in
    practice. However, historically, much attention has been drawn to real-valued
    system models, lacking the research of sparse signal modeling and estimation
    for complex-valued models. This paper introduces a unifying sparse Bayesian
    formalism that generalizes to complex- as well as real-valued systems. The
    methodology relies on hierarchical Bayesian sparsity-inducing prior modeling of
    the parameter of interest.

  79. Simulation-based optimal Bayesian experimental design for nonlinear systems.

    Authors: Xun Huan, Youssef M. Marzouk
    Subjects: Machine Learning
    Abstract

    The optimal selection of experimental conditions is essential to maximizing
    the value of data for inference and prediction, particularly in situations
    where experiments are time-consuming and expensive to conduct. We propose a
    general mathematical framework and an algorithmic approach for optimal
    experimental design with nonlinear simulation-based models; in particular, we
    focus on finding sets of experiments that provide the most information about
    targeted sets of parameters.

  80. Exact covariance thresholding into connected components for large-scale Graphical Lasso.

    Authors: Trevor Hastie, Rahul Mazumder
    Subjects: Machine Learning
    Abstract

    We consider the sparse inverse covariance regularization problem or graphical
    lasso with regularization parameter $\rho$. Suppose the co- variance graph
    formed by thresholding the entries of the sample covariance matrix at $\rho$ is
    decomposed into connected components. We show that the vertex-partition induced
    by the thresholded covariance graph is exactly equal to that induced by the
    estimated concentration graph. This simple rule, when used as a wrapper around
    existing algorithms, leads to enormous performance gains.

  81. Overlapping Mixtures of Gaussian Processes for the Data Association Problem.

    Authors: Miguel L&#xe1;zaro-Gredilla, Steven Van Vaerenbergh, Neil Lawrence
    Subjects: Machine Learning
    Abstract

    In this work we introduce a mixture of GPs to address the data association
    problem, i.e. to label a group of observations according to the sources that
    generated them. Unlike several previously proposed GP mixtures, the novel
    mixture has the distinct characteristic of using no gating function to
    determine the association of samples and mixture components. Instead, all the
    GPs in the mixture are global and samples are clustered following
    "trajectories" across input space.

  82. A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition.

    Authors: Gianluca Bontempi, Souhaib Ben Taieb, Amir Atiya, Antti Sorjamaa
    Subjects: Machine Learning
    Abstract

    Multi-step ahead forecasting is still an open challenge in time series
    forecasting. Several approaches that deal with this complex problem have been
    proposed in the literature but an extensive comparison on a large number of
    tasks is still missing. This paper aims to fill this gap by reviewing existing
    strategies for multi-step ahead forecasting and comparing them in theoretical
    and practical terms.

  83. A consistent dot product embedding for stochastic blockmodel graphs.

    Authors: Carey E. Priebe, Daniel L. Sussman, Minh Tang, Donniell E. Fishkind
    Subjects: Machine Learning
    Abstract

    We present a method to estimate block membership of nodes in a random graph
    generated by a stochastic blockmodel. We use an embedding procedure motivated
    by the random dot product graph model, a particular example of the latent
    position model. The embedded vectors are clustered through minimization of a
    mean square error/criteria. We prove that this method is consistent for
    assigning nodes to blocks, as only a negligible number of nodes will be
    mis-assigned. We prove consistency of the method for directed and undirected
    graphs.

  84. Algebraic Geometric Comparison of Probability Distributions.

    Authors: Klaus-Robert Mueller, Franz J. Kiraly, Paul von Buenau, Frank C. Meinecke, Duncan A. J. Blythe
    Subjects: Machine Learning
    Abstract

    We propose a novel algebraic framework for treating probability distributions
    represented by their cumulants such as the mean and covariance matrix. As an
    example, we consider the unsupervised learning problem of finding the subspace
    on which several probability distributions agree. Instead of minimizing an
    objective function involving the estimated cumulants, we show that by treating
    the cumulants as elements of the polynomial ring we can directly solve the
    problem, at a lower computational cost and with higher accuracy.

  85. On the Evaluation Criterions for the Active Learning Processes.

    Authors: Vladimir Nikulin
    Subjects: Machine Learning
    Abstract

    In many data mining applications collection of sufficiently large datasets is
    the most time consuming and expensive. On the other hand, industrial methods of
    data collection create huge databases, and make difficult direct applications
    of the advanced machine learning algorithms. To address the above problems, we
    consider active learning (AL), which may be very efficient either for the
    experimental design or for the data filtering.

  86. Robustness of Anytime Bandit Policies.

    Authors: Jean-Yves Audibert, Antoine Salomon
    Subjects: Machine Learning
    Abstract

    This paper studies the deviations of the regret in a stochastic multi-armed
    bandit problem. When the total number of plays n is known beforehand by the
    agent, Audibert et al. (2009) exhibit a policy such that with probability at
    least 1-1/n, the regret of the policy is of order log(n). They have also shown
    that such a property is not shared by the popular ucb1 policy of Auer et al.
    (2002). This work first answers an open question: it extends this negative
    result to any anytime policy.

  87. Multi-Task Output Space Regularization.

    Authors: Bela A. Frigyik, Maya R. Gupta, Sergey Feldman, Luca Cazzanti, Peter Sadowski
    Subjects: Machine Learning
    Abstract

    We investigate multi-task learning from an output space regularization
    perspective. Most multi-task approaches tie together related tasks by
    constraining them to share input spaces and function classes. In contrast to
    this, we propose a multi-task paradigm which we call output space
    regularization, in which the only constraint is that the output spaces of the
    multiple tasks are related. We focus on a specific instance of output space
    regularization, multi-task averaging, that is both widely applicable and
    amenable to analysis.

  88. Spectral approximations in machine learning.

    Authors: Daniel J. McDonald, Darren Homrighausen
    Subjects: Machine Learning
    Abstract

    In many areas of machine learning, it becomes necessary to find the
    eigenvector decompositions of large matrices. We discuss two methods for
    reducing the computational burden of spectral decompositions: the more
    venerable Nystom extension and a newly introduced algorithm based on random
    projections. Previous work has centered on the ability to reconstruct the
    original matrix. We argue that a more interesting and relevant comparison is
    their relative performance in clustering and classification tasks using the
    approximate eigenvectors as features.

  89. Finding Non-overlapping Clusters for Generalized Inference Over Graphical Models.

    Authors: Divyanshu Vats, Jos&#xe9; M. F. Moura
    Subjects: Machine Learning
    Abstract

    Graphical models compactly capture stochastic dependencies amongst a
    collection of random variables using a graph. Inference over graphical models
    corresponds to finding marginal probability distributions given joint
    probability distributions. Several inference algorithms rely on iterative
    message passing between nodes. Although these algorithms can be generalized so
    that the message passing occurs between clusters of nodes, there are limited
    frameworks for finding such clusters. Moreover, current frameworks rely on
    finding clusters that are overlapping.

  90. Edit wars in Wikipedia.

    Authors: R&#xf3;bert Sumi, Taha Yasseri, Andr&#xe1;s Rung, Andr&#xe1;s Kornai, J&#xe1;nos Kert&#xe9;sz
    Subjects: Machine Learning
    Abstract

    We present a new, efficient method for automatically detecting severe
    conflicts `edit wars' in Wikipedia and evaluate this method on six different
    language WPs. We discuss how the number of edits, reverts, the length of
    discussions, the burstiness of edits and reverts deviate in such pages from
    those following the general workflow, and argue that earlier work has
    significantly over-estimated the contentiousness of the Wikipedia editing
    process.

  91. Unsupervised K-Nearest Neighbor Regression.

    Authors: Oliver Kramer
    Subjects: Machine Learning
    Abstract

    In many scientific disciplines structures in high-dimensional data have to be
    found, e.g., in stellar spectra, in genome data, or for face recognition tasks.
    In this work we present a novel approach to non-linear dimensionality
    reduction. It is based on fitting K-nearest neighbor regression to the
    unsupervised regression framework for learning of low-dimensional manifolds.
    Similar to related approaches that are mostly based on kernel methods,
    unsupervised K-nearest neighbor (UKNN) regression optimizes latent variables
    w.r.t.

  92. Robust Kernel Density Estimation.

    Authors: JooSeuk Kim, Clayton D. Scott
    Subjects: Machine Learning
    Abstract

    We propose a method for nonparametric density estimation that exhibits
    robustness to contamination of the training sample. This method achieves
    robustness by combining a traditional kernel density estimator (KDE) with ideas
    from classical M-estimation. We interpret the KDE based on a radial, positive
    semi-definite kernel as a sample mean in the associated reproducing kernel
    Hilbert space. Since the sample mean is sensitive to outliers, we estimate it
    robustly via M-estimation, yielding a robust kernel density estimator (RKDE).

  93. Linear Latent Force Models using Gaussian Processes.

    Authors: Mauricio A. &#xc1;lvarez, Neil D. Lawrence, David Luengo
    Subjects: Machine Learning
    Abstract

    Purely data driven approaches for machine learning present difficulties when
    data is scarce relative to the complexity of the model or when the model is
    forced to extrapolate. On the other hand, purely mechanistic approaches need to
    identify and specify all the interactions in the problem at hand (which may not
    be feasible) and still leave the issue of how to parameterize the system. In
    this paper, we present a hybrid approach using Gaussian processes and
    differential equations to combine data driven modelling with a physical model
    of the system.

  94. Statistical Topic Models for Multi-Label Document Classification.

    Authors: Timothy N. Rubin, America Chambers, Padhraic Smyth, Mark Steyvers
    Subjects: Machine Learning
    Abstract

    Machine learning approaches to multi-label document classification have (to
    date) largely relied on discriminative modeling techniques such as support
    vector machines. A drawback of these approaches is that performance rapidly
    drops off as the total number of labels and the number of labels per document
    increase.

  95. Loss-sensitive Training of Probabilistic Conditional Random Fields.

    Authors: Hugo Larochelle, Richard S. Zemel, Maksims N. Volkovs
    Subjects: Machine Learning
    Abstract

    We consider the problem of training probabilistic conditional random fields
    (CRFs) in the context of a task where performance is measured using a specific
    loss function. While maximum likelihood is the most common approach to training
    CRFs, it ignores the inherent structure of the task's loss function.

  96. High-Dimensional Structure Estimation in Ising Models: Tractable Graph Families.

    Authors: Animashree Anandkumar, Alan S. Willsky, Vincent Y.F. Tan
    Subjects: Machine Learning
    Abstract

    We consider the problem of high-dimensional Ising (graphical) model
    selection. We propose a simple algorithm for structure estimation based on the
    thresholding of the empirical conditional mutual information quantities. This
    algorithm requires only low-order statistics of the data and has a sample
    complexity of n =omega(J_{min}^{-4} log p), where p is the number of variables
    and $J_{\min}$ is the minimum (absolute) edge potential in the model. We also
    establish necessary conditions for structure estimation.

  97. Kernels for Vector-Valued Functions: a Review.

    Authors: Neil D. Lawrence, Lorenzo Rosasco, Mauricio A. Alvarez
    Subjects: Machine Learning
    Abstract

    Kernel methods are among the most popular techniques in machine learning.
    From a frequentist/discriminative perspective they play a central role in
    regularization theory as they provide a natural choice for the hypotheses space
    and the regularization functional through the notion of reproducing kernel
    Hilbert spaces. From a Bayesian/generative perspective they are the key in the
    context of Gaussian processes, where the kernel function is also known as the
    covariance function.

  98. Dynamic Large Spatial Covariance Matrix Estimation in Application to Semiparametric Model Construction via Variable Clustering: the SCE approach.

    Authors: Song Song
    Subjects: Machine Learning
    Abstract

    To better understand the spatial structure of large panels of economic and
    financial time series and provide a guideline for constructing semiparametric
    models, this paper first considers estimating a large spatial covariance matrix
    of the generalized $m$-dependent and $\beta$-mixing time series (with $J$
    variables and $T$ observations) by hard thresholding regularization as long as
    ${{\log J \, \cx^*(\ct)}}/{T} = \Co(1)$ (the former scheme with some time
    dependence measure $\cx^*(\ct)$) or $\log J /{T} = \Co(1)$ (the latter scheme
    with some upper bounded mixing coefficient).

  99. Relative Density-Ratio Estimation for Robust Distribution Comparison.

    Authors: Taiji Suzuki, Masashi Sugiyama, Takafumi Kanamori, Makoto Yamada, Hirotaka Hachiya
    Subjects: Machine Learning
    Abstract

    Divergence estimators based on direct approximation of density-ratios without
    going through separate approximation of numerator and denominator densities
    have been successfully applied to machine learning tasks that involve
    distribution comparison such as outlier detection, transfer learning, and
    two-sample homogeneity test. However, since density-ratio functions often
    possess high fluctuation, divergence estimation is still a challenging task in
    practice.

  100. Natural Evolution Strategies.

    Authors: Yi Sun, Daan Wierstra, Tom Schaul, Tobias Glasmachers, J&#xfc;rgen Schmidhuber
    Subjects: Machine Learning
    Abstract

    This paper presents Natural Evolution Strategies (NES), a recent family of
    algorithms that constitute a more principled approach to black-box optimization
    than established evolutionary algorithms. NES maintains a parameterized
    distribution on the set of solution candidates, and the natural gradient is
    used to update the distribution's parameters in the direction of higher
    expected fitness. We introduce a collection of techniques that address issues
    of convergence, robustness, sample complexity, computational complexity and
    sensitivity to hyperparameters.

  101. Gaussian Process Regression with a Student-t Likelihood.

    Authors: Pasi Jyl&#xe4;nki, Jarno Vanhatalo, Aki Vehtari
    Subjects: Machine Learning
    Abstract

    This paper considers the robust and efficient implementation of Gaussian
    process regression with a Student-t observation model. The challenge with the
    Student-t model is the analytically intractable inference which is why several
    approximative methods have been proposed. The expectation propagation (EP) has
    been found to be a very accurate method in many empirical studies but the
    convergence of the EP is known to be problematic with models containing
    non-log-concave site functions such as the Student-t distribution.

  102. Tight Measurement Bounds for Exact Recovery of Structured Sparse Signals.

    Authors: Robert Nowak, Benjamin Recht, Nikhil Rao
    Subjects: Machine Learning
    Abstract

    Standard compressive sensing results state that to exactly recover an s
    sparse signal in R^p, one requires O(s\cdotlog p) measurements. While this
    bound is extremely useful in practice, often real world signals are not only
    sparse, but also exhibit structure in the sparsity pattern. We focus on
    group-structured patterns in this paper. Under this model, groups of signal
    coefficients are active (or inactive) together. The groups are predefined, but
    the particular set of groups that are active (i.e., in the signal support) must
    be learned from measurements.

  103. Residual Component Analysis.

    Authors: Neil D. Lawrence, Alfredo A. Kalaitzis
    Subjects: Machine Learning
    Abstract

    Probabilistic principal component analysis (PPCA) seeks a low dimensional
    representation of a data set in the presence of independent spherical Gaussian
    noise, Sigma = (sigma^2)*I. The maximum likelihood solution for the model is an
    eigenvalue problem on the sample covariance matrix. In this paper we consider
    the situation where the data variance is already partially explained by other
    factors, e.g. covariates of interest, or temporal correlations leaving some
    residual variance.

  104. Online algorithms for Nonnegative Matrix Factorization with the Itakura-Saito divergence.

    Authors: Francis Bach, Augustin Lef&#xe8;vre, C&#xe9;dric F&#xe9;votte
    Subjects: Machine Learning
    Abstract

    Nonnegative matrix factorization (NMF) is now a common tool for audio source
    separation. When learning NMF on large audio databases, one major drawback is
    that the complexity in time is O(FKN) when updating the dictionary (where (F;N)
    is the dimension of the input power spectrograms, and K the number of basis
    spectra), thus forbidding its application on signals longer than an hour. We
    provide an online algorithm with a complexity of O(FK) in time and memory for
    updates in the dictionary.

  105. A Tutorial on Bayesian Nonparametric Models.

    Authors: David M. Blei, Samuel J. Gershman
    Subjects: Machine Learning
    Abstract

    A key problem in statistical modeling is model selection, how to choose a
    model at an appropriate level of complexity. This problem appears in many
    settings, most prominently in choosing the number ofclusters in mixture models
    or the number of factors in factor analysis. In this tutorial we describe
    Bayesian nonparametric methods, a class of methods that side-steps this issue
    by allowing the data to determine the complexity of the model. This tutorial is
    a high-level introduction to Bayesian nonparametric methods and contains
    several examples of their application.

  106. Pitman-Yor Diffusion Trees.

    Authors: Zoubin Ghahramani, David A. Knowles
    Subjects: Machine Learning
    Abstract

    We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical
    clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which
    removes the restriction to binary branching structure. The generative process
    is described and shown to result in an exchangeable distribution over data
    points. We prove some theoretical properties of the model and then present two
    inference methods: a collapsed MCMC sampler which allows us to model
    uncertainty over tree structures, and a computationally efficient greedy
    Bayesian EM search algorithm.

  107. Source Separation and Clustering of Phase-Locked Subspaces: Derivations and Proofs.

    Authors: Miguel Almeida, Jan-Hendrik Schleimer, Jos&#xe9; Bioucas-Dias, Ricardo Vig&#xe1;rio
    Subjects: Machine Learning
    Abstract

    Due to space limitations, our submission "Source Separation and Clustering of
    Phase-Locked Subspaces", accepted for publication on the IEEE Transactions on
    Neural Networks in 2011, presented some results without proof. Those proofs are
    provided in this paper.

  108. Fast, Linear Time Hierarchical Clustering using the Baire Metric.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Machine Learning
    Abstract

    The Baire metric induces an ultrametric on a dataset and is of linear
    computational complexity, contrasted with the standard quadratic time
    agglomerative hierarchical clustering algorithm. In this work we evaluate
    empirically this new approach to hierarchical clustering. We compare
    hierarchical clustering based on the Baire metric with (i) agglomerative
    hierarchical clustering, in terms of algorithm properties; (ii) generalized
    ultrametrics, in terms of definition; and (iii) fast clustering through k-means
    partititioning, in terms of quality of results.

  109. Ranking via Sinkhorn Propagation.

    Authors: Ryan Prescott Adams, Richard S. Zemel
    Subjects: Machine Learning
    Abstract

    It is of increasing importance to develop learning methods for ranking. In
    contrast to many learning objectives, however, the ranking problem presents
    difficulties due to the fact that the space of permutations is not smooth. In
    this paper, we examine the class of rank-linear objective functions, which
    includes popular metrics such as precision and discounted cumulative gain. In
    particular, we observe that expectations of these gains are completely
    characterized by the marginals of the corresponding distribution over
    permutation matrices.

  110. Moment based estimation of stochastic Kronecker graph parameters.

    Authors: Art B. Owen, David F. Gleich
    Subjects: Machine Learning
    Abstract

    Stochastic Kronecker graphs supply a parsimonious model for large sparse real
    world graphs. They can specify the distribution of a large random graph using
    only three or four parameters. Those parameters have however proved difficult
    to choose in specific applications. This article looks at method of moments
    estimators that are computationally much simpler than maximum likelihood. The
    estimators are fast and in our examples, they typically yield Kronecker
    parameters with expected feature counts closer to a given graph than we get
    from KronFit.

  111. Optimal Reinforcement Learning for Gaussian Systems.

    Authors: Philipp Hennig
    Subjects: Machine Learning
    Abstract

    The exploration-exploitation tradeoff is among the central challenges of
    reinforcement learning. A hypothetical exact Bayesian learner would provide the
    optimal solution, but is intractable in general. I show that, however, in the
    specific case of Gaussian process inference, it is possible to make analytic
    statements about optimal learning of both rewards and transition dynamics, for
    nonlinear, time-varying systems in continuous time and space, subject to a
    relatively weak restriction on the dynamics. The solution is described by an
    infinite-dimensional differential equation.

  112. Hashing Algorithms for Large-Scale Learning.

    Authors: Ping Li, Arnd Christian Konig, Joshua Moore, Anshumali Shrivastava
    Subjects: Machine Learning
    Abstract

    In this paper, we first demonstrate that b-bit minwise hashing, whose
    estimators are positive definite kernels, can be naturally integrated with
    learning algorithms such as SVM and logistic regression. We adopt a simple
    scheme to transform the nonlinear (resemblance) kernel into linear (inner
    product) kernel; and hence large-scale problems can be solved extremely
    efficiently. Our method provides a simple effective solution to large-scale
    learning in massive and extremely high-dimensional datasets, especially when
    data do not fit in memory.

  113. Causal Network Inference via Group Sparse Regularization.

    Authors: Robert Nowak, Andrew Bolstad, Barry Van Veen
    Subjects: Machine Learning
    Abstract

    This paper addresses the problem of inferring sparse causal networks modeled
    by multivariate auto-regressive (MAR) processes. Conditions are derived under
    which the Group Lasso (gLasso) procedure consistently estimates sparse network
    structure. The key condition involves a "false connection score." In
    particular, we show that consistent recovery is possible even when the number
    of observations of the network is far less than the number of parameters
    describing the network, provided that the false connection score is less than
    one.

  114. Multi-stage Convex Relaxation for Feature Selection.

    Authors: Multi-stage Convex Relaxation for Feature Selection
    Subjects: Machine Learning
    Abstract

    A number of recent work studied the effectiveness of feature selection using
    Lasso. It is known that under the restricted isometry properties (RIP), Lasso
    does not generally lead to the exact recovery of the set of nonzero
    coefficients, due to the looseness of convex relaxation. This paper considers
    the feature selection property of nonconvex regularization, where the solution
    is given by a multi-stage convex relaxation scheme.

  115. Multidimensional Scaling in the Poincare Disk.

    Authors: Andrej Cvetkovski, Mark Crovella
    Subjects: Machine Learning
    Abstract

    Multidimensional scaling (MDS) is a class of projective algorithms
    traditionally used to produce two- or three-dimensional visualizations of
    datasets consisting of multidimensional objects or interobject distances.
    Recently, metric MDS has been applied to the problems of graph embedding for
    the purpose of approximate encoding of edge or path costs using node
    coordinates in metric space. Several authors have also pointed out that for
    data with an inherent hierarchical structure, hyperbolic target space may be a
    more suitable choice for accurate embedding than Euclidean space.

  116. Adaptive and Optimal Online Linear Regression on L1-balls.

    Authors: S&#xe9;bastien Gerchinovitz, Jia Yuan Yu
    Subjects: Machine Learning
    Abstract

    We consider the problem of online linear regression on individual sequences.
    The goal in this paper is for the forecaster to output sequential predictions
    which are, after T time rounds, almost as good as the ones output by the best
    linear predictor in a given L1-ball in R^d. We consider both the cases where
    the dimension d is small and large relative to the time horizon T. We first
    present regret bounds with optimal dependencies on the sizes U, X and Y of the
    L1-ball, the input data and the observations.

  117. Independent screening for single-index hazard rate models with ultra-high dimensional features.

    Authors: Anders Gorst-Rasmussen, Thomas H. Scheike
    Subjects: Machine Learning
    Abstract

    In data sets with many more features than observations, independent screening
    based on all univariate regression models leads to a computationally convenient
    variable selection method. Recent efforts have shown that in the case of
    generalized linear models, independent screening may suffice to capture all
    relevant features with high probability, even in ultra-high dimension. It is
    unclear whether this formal sure screening property is attainable when the
    response is a right-censored survival time.

  118. Bounds on the Maximum Bayes Error Given Moments.

    Authors: Bela A. Frigyik, Maya R. Gupta
    Subjects: Machine Learning
    Abstract

    We show how to compute lower bounds for the maximum possible Bayes error if
    the class-conditional distributions must satisfy moment constraints. Our
    approach makes use of Curto and Fialkow's solutions for the truncated moment
    problem. We also give an upper bound that follows from related work by
    Lanckreit et al., Bertsimas and Popescu, and Marshall and Olkin.

  119. Closed-Form EM for Sparse Coding and Its Application to Source Separation.

    Authors: J&#xf6;rg L&#xfc;cke, Abdul-Saboor Sheikh
    Subjects: Machine Learning
    Abstract

    We define and discuss the first sparse coding algorithm based on closed-form
    EM updates and continuous latent variables. The underlying generative model
    consists of a flexibly parameterized `spike-and-slab' prior and a standard
    Gaussian noise model. Closed-form solutions for E- and M-step equations are
    derived by generalizing probabilistic PCA. The resulting EM algorithm infers
    all model parameters including the level of sparsity, and it takes all modes of
    a potentially multi-modal posterior into account.

  120. Feedback Message Passing for Inference in Gaussian Graphical Models.

    Authors: Animashree Anandkumar, Alan S. Willsky, Venkat Chandrasekaran, Ying Liu
    Subjects: Machine Learning
    Abstract

    While loopy belief propagation (LBP) performs reasonably well for inference
    in some Gaussian graphical models with cycles, its performance is
    unsatisfactory for many others. In particular for some models LBP does not
    converge, and in general when it does converge, the computed variances are
    incorrect (except for cycle-free graphs for which belief propagation (BP) is
    non-iterative and exact).

  121. Order-preserving factor analysis (OPFA).

    Authors: Alfred O. Hero III, Arnau Tibau Puig
    Subjects: Machine Learning
    Abstract

    We present a novel factor analysis method that can be applied to the
    discovery of common factors shared among trajectories in multivariate time
    series data. These factors satisfy a precedence-ordering property: certain
    factors are recruited only after some other factors are activated.
    Precedence-ordering arise in applications where variables are activated in a
    specific order, which is unknown. The proposed method is based on a linear
    model that accounts for each factor's inherent delays and relative order.

  122. Interpreting Graph Cuts as a Max-Product Algorithm.

    Authors: Daniel Tarlow, Inmar E. Givoni, Richard S. Zemel, Brendan J. Frey
    Subjects: Machine Learning
    Abstract

    The maximum a posteriori (MAP) configuration of binary variable models with
    submodular graph-structured energy functions can be found efficiently and
    exactly by graph cuts. Max-product belief propagation (MP) has been shown to be
    suboptimal on this class of energy functions by a canonical counterexample
    where MP converges to a suboptimal fixed point (Kulesza & Pereira, 2008).

  123. Pruning nearest neighbor cluster trees.

    Authors: Ulrike von Luxburg, Samory Kpotufe
    Subjects: Machine Learning
    Abstract

    Nearest neighbor (k-NN) graphs are widely used in machine learning and data
    mining applications, and our aim is to better understand what they reveal about
    the cluster structure of the unknown underlying distribution of points.
    Moreover, is it possible to identify spurious structures that might arise due
    to sampling variability?

  124. Multi-scale Mining of fMRI data with Hierarchical Structured Sparsity.

    Authors: Francis Bach, Rodolphe Jenatton, Guillaume Obozinski, Bertrand Thirion, Alexandre Gramfort, Vincent Michel, Evelyn Eger
    Subjects: Machine Learning
    Abstract

    Inverse inference, or "brain reading", is a recent paradigm for analyzing
    functional magnetic resonance imaging (fMRI) data, based on pattern recognition
    and statistical learning. By predicting some cognitive variables related to
    brain activation maps, this approach aims at decoding brain activity. Inverse
    inference takes into account the multivariate information between voxels and is
    currently the only way to assess how precisely some cognitive information is
    encoded by the activity of neural populations within the whole brain.

  125. SERAPH: Semi-supervised Metric Learning Paradigm with Hyper Sparsity.

    Authors: Masashi Sugiyama, Makoto Yamada, Gang Niu, Bo Dai
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning a distance metric from a limited amount
    of pairwise information as effectively as possible. The proposed SERAPH
    (SEmi-supervised metRic leArning Paradigm with Hyper sparsity) is a direct and
    substantially more natural approach for semi-supervised metric learning, since
    the supervised and unsupervised parts are based on a unified information
    theoretic framework.

  126. Iterative Reweighted Algorithms for Sparse Signal Recovery with Temporally Correlated Source Vectors.

    Authors: Bhaskar D. Rao, Zhilin Zhang
    Subjects: Machine Learning
    Abstract

    Iterative reweighted algorithms, as a class of algorithms for sparse signal
    recovery, have been found to have better performance than their non-reweighted
    counterparts. However, for solving the problem of multiple measurement vectors
    (MMVs), all the existing reweighted algorithms do not account for temporal
    correlation among source vectors and thus their performance degrades
    significantly in the presence of correlation.

  127. Online Learning: Stochastic and Constrained Adversaries.

    Authors: Ambuj Tewari, Karthik Sridharan, Alexander Rakhlin
    Subjects: Machine Learning
    Abstract

    Learning theory has largely focused on two main learning scenarios. The first
    is the classical statistical setting where instances are drawn i.i.d. from a
    fixed distribution and the second scenario is the online learning, completely
    adversarial scenario where adversary at every time step picks the worst
    instance to provide the learner with. It can be argued that in the real world
    neither of these assumptions are reasonable. It is therefore important to study
    problems with a range of assumptions on data.

  128. Fast global convergence of gradient methods for high-dimensional statistical recovery.

    Authors: Martin J. Wainwright, Alekh Agarwal, Sahand N. Negahban
    Subjects: Machine Learning
    Abstract

    Many statistical M-estimators are based on convex optimization problems
    formed by the combination of a data-dependent loss function with a norm-based
    regularizer. We analyze the convergence rates of projected gradient methods for
    solving such problems, working within a high-dimensional framework that allows
    the data dimension d to grow with (and possibly exceed) the sample size n. This
    high-dimensional structure precludes the usual global assumptions---namely,
    strong convexity and smoothness conditions---that underlie much of classical
    optimization analysis.

  129. Compressive Network Analysis.

    Authors: Han Liu, Yuan Yao, Xiaoye Jiang, Leonidas Guibas
    Subjects: Machine Learning
    Abstract

    Modern data acquisition routinely produces massive amounts of network data.
    Though many methods and models have been proposed to analyze such data, the
    research of network data is largely disconnected with the classical theory of
    statistical learning and signal processing. In this paper, we present a new
    framework for modeling network data, which connects two seemingly different
    areas: network data analysis and compressed sensing. From a nonparametric
    perspective, we model an observed network using a large dictionary.

  130. Scaled Sparse Linear Regression.

    Authors: Cun-Hui Zhang, Tingni Sun
    Subjects: Machine Learning
    Abstract

    Scaled sparse linear regression jointly estimates the regression coefficients
    and noise level in a linear model. It chooses an equilibrium with a sparse
    regression method by iteratively estimating the noise level via the mean
    residual squares and scaling the penalty in proportion to the estimated noise
    level.

  131. Robust Clustering Using Outlier-Sparsity Regularization.

    Authors: Georgios B. Giannakis, Vassilis Kekatos, Pedro A. Forero
    Subjects: Machine Learning
    Abstract

    Notwithstanding the popularity of conventional clustering algorithms such as
    K-means and probabilistic clustering, their clustering results are sensitive to
    the presence of outliers in the data. Even a few outliers can compromise the
    ability of these algorithms to identify meaningful hidden structures rendering
    their outcome unreliable. This paper develops robust clustering algorithms that
    not only aim to cluster the data, but also to identify the outliers.

  132. On the Fourier transform of exponentiated Euclidean distance functions: Properties and applications to gradient density estimation.

    Authors: Karthik S. Gurumoorthy, Anand Rangarajan, Arunava Banerjee
    Subjects: Machine Learning
    Abstract

    The complex wave representation (CWR) converts unsigned 2D distance
    transforms into their corresponding wave functions when the distance transform
    S appears as the phase of the wave function \phi specifically, \phi = \exp
    (\frac{iS}{\tau}). In this work we prove a novel result wherein we show
    convergence of the normalized power spectrum (square magnitude of the Fourier
    transform) of the wave function to the density function of the distance
    transform gradients as the free parameter \tau --> 0--in colloquial terms,
    spatial frequencies are gradient histogram bins.

  133. Unified Treatment of Hidden Markov Switching Models.

    Authors: Silvia Chiappa
    Subjects: Machine Learning
    Abstract

    Many real-world problems encountered in several disciplines deal with the
    modeling of time-series containing different underlying dynamical regimes, for
    which probabilistic approaches are very often employed. In this paper we
    describe several such approaches in the common framework of graphical models.
    We give a unified overview of models previously introduced in the literature,
    which is simpler and more comprehensive than previous descriptions and enables
    us to highlight commonalities and differences among models that were not
    observed in the past.

  134. Planar Cycle Covering Graphs.

    Authors: Julian Yarkony, Alexander T. Ihler, Charless C. Fowlkes
    Subjects: Machine Learning
    Abstract

    We describe a new variational lower-bound on the minimum energy configuration
    of a planar binary Markov Random Field (MRF). Our method is based on adding
    auxiliary nodes to every face of a planar embedding of the graph in order to
    capture the effect of unary potentials. A ground state of the resulting
    approximation can be computed efficiently by reduction to minimum-weight
    perfect matching. We show that optimization of variational parameters achieves
    the same lower-bound as dual-decomposition into the set of all cycles of the
    original graph.

  135. On Identifying Significant Edges in Graphical Models.

    Authors: Marco Scutari, Radhakrishnan Nagarajan
    Subjects: Machine Learning
    Abstract

    Graphical models, and in particular Bayesian networks, have been widely used
    to investigate data in the biological and healthcare domains. This can be
    attributed to the recent explosion of high-throughput data across these domains
    and the importance of understanding the causal relationships between the
    variables of interest. However, classic model validation techniques for
    identifying significant edges rely on the choice of an ad-hoc threshold, which
    is non-trivial and can have a pronounced impact on the conclusions of the
    analysis.

  136. Robust Nonparametric Regression via Sparsity Control with Application to Load Curve Data Cleansing.

    Authors: Georgios B. Giannakis, Gonzalo Mateos
    Subjects: Machine Learning
    Abstract

    Nonparametric methods are widely applicable to statistical inference
    problems, since they rely on a few modeling assumptions. In this context, the
    fresh look advocated here permeates benefits from variable selection and
    compressive sampling, to robustify nonparametric regression against outliers -
    that is, data markedly deviating from the postulated models. A variational
    counterpart to least-trimmed squares regression is shown closely related to an
    L0-(pseudo)norm-regularized estimator, that encourages sparsity in a vector
    explicitly modeling the outliers.

  137. Least-Squares Independence Regression for Non-Linear Causal Inference under Non-Gaussian Noise.

    Authors: Masashi Sugiyama, Makoto Yamada, Jun Sese
    Subjects: Machine Learning
    Abstract

    The discovery of non-linear causal relationship under additive non-Gaussian
    noise models has attracted considerable attention recently because of their
    high flexibility. In this paper, we propose a novel causal inference algorithm
    called least-squares independence regression (LSIR). LSIR learns the additive
    noise model through the minimization of an estimator of the squared-loss mutual
    information between inputs and residuals.

  138. Fast Learning Rate of lp-MKL and its Minimax Optimality.

    Authors: Taiji Suzuki
    Subjects: Machine Learning
    Abstract

    In this paper, we give a new sharp generalization bound of lp-MKL which is a
    generalized framework of multiple kernel learning (MKL) and imposes
    lp-mixed-norm regularization instead of l1-mixed-norm regularization. We
    utilize localization techniques to obtain the sharp learning rate. The bound is
    characterized by the decay rate of the eigenvalues of the associated kernels. A
    larger decay rate gives a faster convergence rate. Furthermore, we give the
    minimax learning rate on the ball characterized by lp-mixed-norm in the product
    space.

  139. Sharp Convergence Rate and Support Consistency of Multiple Kernel Learning with Sparse and Dense Regularization.

    Authors: Taiji Suzuki, Ryota Tomioka, Masashi Sugiyama
    Subjects: Machine Learning
    Abstract

    We theoretically investigate the convergence rate and support consistency
    (i.e., correctly identifying the subset of non-zero coefficients in the large
    sample limit) of multiple kernel learning (MKL). We focus on MKL with block-l1
    regularization (inducing sparse kernel combination), block-l2 regularization
    (inducing uniform kernel combination), and elastic-net regularization
    (including both block-l1 and block-l2 regularization).

  140. Sufficient Component Analysis for Supervised Dimension Reduction.

    Authors: Masashi Sugiyama, Makoto Yamada, Gang Niu, Jun Takagi
    Subjects: Machine Learning
    Abstract

    The purpose of sufficient dimension reduction (SDR) is to find the
    low-dimensional subspace of input features that is sufficient for predicting
    output values. In this paper, we propose a novel distribution-free SDR method
    called sufficient component analysis (SCA), which is computationally more
    efficient than existing methods.

  141. Theoretical Properties of the Overlapping Groups Lasso.

    Authors: Daniel Percival
    Subjects: Machine Learning
    Abstract

    We present two sets of theoretical results on the grouped lasso with overlap
    of Jacob, Obozinski and Vert (2009) in the linear regression setting. This
    method allows for joint selection of predictors in sparse regression, allowing
    for complex structured sparsity over the predictors encoded as a set of groups.
    This flexible framework suggests that arbitrarily complex structures can be
    encoded with an intricate set of groups. Our results show that this strategy
    results in unexpected theoretical consequences for the procedure.

  142. Additive Kernels for Gaussian Process Modeling.

    Authors: David Ginsbourger, Nicolas Durrande, Olivier Roustant
    Subjects: Machine Learning
    Abstract

    Gaussian Process (GP) models are often used as mathematical approximations of
    computationally expensive experiments. Provided that its kernel is suitably
    chosen and that enough data is available to obtain a reasonable fit of the
    simulator, a GP model can beneficially be used for tasks such as prediction,
    optimization, or Monte-Carlo-based quantification of uncertainty. However, the
    former conditions become unrealistic when using classical GPs as the dimension
    of input increases.

  143. A Frequency Domain EM Algorithm to Detect Similar Dynamics in Time Series with Applications to Spike Sorting and Macro-Economics.

    Authors: Georg M. Goerg
    Subjects: Machine Learning
    Abstract

    In this work I propose a frequency domain adaptation of the Expectation
    Maximization (EM) algorithm to separate a family of sequential observations in
    classes of similar dynamic structure, which can either mean non-stationary
    signals of similar shape, or stationary signals with similar auto-covariance
    function. It does this by viewing the magnitude of the discrete Fourier
    transform (DFT) of the signals (or power spectrum) as a probability
    density/mass function (pdf/pmf) on the unit circle: signals with similar
    dynamics have similar pdfs; distinct patterns have distinct pdfs.

  144. Constrained Mixture Models for Asset Returns Modelling.

    Authors: Iead Rezek
    Subjects: Machine Learning
    Abstract

    The estimation of asset return distributions is crucial for determining
    optimal trading strategies. In this paper we describe the constrained mixture
    model, based on a mixture of Gamma and Gaussian distributions, to provide an
    accurate description of price trends as being clearly positive, negative or
    ranging while accounting for heavy tails and high kurtosis. The model is
    estimated in the Expectation Maximisation framework and model order estimation
    also respects the model's constraints.

  145. Nonparametric Bayesian Density Estimation via the Kernel Trick.

    Authors: Ferenc Huszar, Simon Lacoste-Julien
    Subjects: Machine Learning
    Abstract

    Kernel methods have become increasingly popular in a variety of machine
    learning applications due to their flexibility and appealing algorithmic
    properties. In the context of Bayesian inference, the kernel trick can be used
    to construct nonparametric Bayesian methods directly from parametric models.
    For example, Gaussian process regression can be derived from kernelising
    Bayesian linear regression. Despite the success of the Gaussian process
    framework, the kernel trick is rarely explicitly considered in the Bayesian
    literature.

  146. Estimating $\beta$-mixing coefficients.

    Authors: Cosma Rohilla Shalizi, Daniel J. McDonald, Mark Schervish
    Subjects: Machine Learning
    Abstract

    The literature on statistical learning for time series assumes the asymptotic
    independence or ``mixing' of the data-generating process. These mixing
    assumptions are never tested, nor are there methods for estimating mixing rates
    from data. We give an estimator for the $\beta$-mixing rate based on a single
    stationary sample path and show it is $L_1$-risk consistent.

  147. Generalization error bounds for stationary autoregressive models.

    Authors: Cosma Rohilla Shalizi, Daniel J. McDonald, Mark Schervish
    Subjects: Machine Learning
    Abstract

    We derive generalization error bounds for stationary univariate
    autoregressive (AR) models. We show that the stationarity assumption alone lets
    us treat the estimation of AR models as a regularized kernel regression without
    the need to further regularize the model arbitrarily. We thereby bound the
    Rademacher complexity of AR models and apply existing Rademacher complexity
    results to characterize the predictive risk of AR models. We demonstrate our
    methods by predicting interest rate movements.

  148. Adapting to Non-stationarity with Growing Expert Ensembles.

    Authors: Cosma Rohilla Shalizi, Abigail Z. Jacobs, Aaron Clauset
    Subjects: Machine Learning
    Abstract

    When dealing with time series with complex and uncertain non-stationarities,
    low retrospective regret on individual realizations is in general a more
    appropriate goal than low prospective risk in expectation.

  149. Multiple Kernel Learning: A Unifying Probabilistic Viewpoint.

    Authors: Hannes Nickisch, Matthias Seeger
    Subjects: Machine Learning
    Abstract

    We present a probabilistic viewpoint to multiple kernel learning unifying
    well-known regularised risk approaches and recent advances in approximate
    Bayesian inference relaxations. The framework proposes a general objective
    function suitable for regression, robust regression and classification that is
    lower bound of the marginal likelihood and contains many regularised risk
    approaches as special cases. Furthermore, we derive an efficient and provably
    convergent optimisation algorithm.

  150. The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning.

    Authors: Gilles Blanchard, Marius Kloft
    Subjects: Machine Learning
    Abstract

    We derive an upper bound on the local Rademacher complexity of $\ell_p$-norm
    multiple kernel learning, which yields a tighter excess risk bound than global
    approaches. Previous local approaches aimed at analyzed the case $p=1$ only
    while our analysis covers all cases $1\leq p\leq\infty$, assuming the different
    feature mappings corresponding to the different kernels to be uncorrelated.

  151. Fast Convergence Rate of Multiple Kernel Learning with Elastic-net Regularization.

    Authors: Taiji Suzuki, Ryota Tomioka, Masashi Sugiyama
    Subjects: Machine Learning
    Abstract

    We investigate the learning rate of multiple kernel leaning (MKL) with
    elastic-net regularization, which consists of an $\ell_1$-regularizer for
    inducing the sparsity and an $\ell_2$-regularizer for controlling the
    smoothness. We focus on a sparse setting where the total number of kernels is
    large but the number of non-zero components of the ground truth is relatively
    small, and prove that elastic-net MKL achieves the minimax learning rate on the
    $\ell_2$-mixed-norm ball.

  152. Probabilistic analysis of the human transcriptome with side information.

    Authors: Leo Lahti
    Subjects: Machine Learning
    Abstract

    Understanding functional organization of genetic information is a major
    challenge in modern biology.

  153. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions.

    Authors: Martin J. Wainwright, Sahand Negahban, Alekh Agarwal
    Subjects: Machine Learning
    Abstract

    We analyze a class of estimators based on convex relaxation for solving
    high-dimensional matrix decomposition problems. The observations are the noisy
    realizations of the sum of an (appproximately) low rank matrix $\Theta^\star$
    with a second matrix $\Gamma^\star$ endowed with a complementary form of
    low-dimensional structure. We derive a general theorem that gives upper bounds
    on the Frobenius norm error for an estimate of the pair $(\Theta^\star,
    \Gamma^\star)$ obtained by solving a convex optimization problem that combines
    the nuclear norm with a general decomposable regularizer.

  154. Predictive Active Set Selection Methods for Gaussian Processes.

    Authors: Ricardo Henao, Ole Winther
    Subjects: Machine Learning
    Abstract

    We propose an active set selection framework for Gaussian process
    classification for cases when the dataset is large enough to render its
    inference prohibitive. Our scheme consists on a two step alternating procedure
    of active set update rules and hyperparameter optimization based upon marginal
    likelihood maximization. The active set update rules rely on the ability of the
    predictive distributions of a Gaussian process classifier to estimate the
    relative contribution of a datapoint when being either included or removed from
    the model.

  155. Joint and Individual Variation Explained (JIVE) for Integrated Analysis of Multiple Datatypes.

    Authors: J.S. Marron, Andrew B. Nobel, Eric F. Lock, Katherine A. Hoadley
    Subjects: Machine Learning
    Abstract

    Research in a number of fields now requires the analysis of "multi-block"
    data, in which multiple high-dimensional, and fundamentally disparate,
    datatypes are available for a common set of objects. In this paper we introduce
    Joint and Individual Variation Explained (JIVE), a general method for the
    integrated analysis of multi-block data. The method decomposes a multi-block
    dataset into a sum of three terms: a low-rank approximation capturing joint
    variation across datatypes, low-rank approximations for structured variation
    individual to each datatype, and residual noise.

  156. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection.

    Authors: David Kempe, Abhimanyu Das
    Subjects: Machine Learning
    Abstract

    We study the problem of selecting a subset of $k$ random variables from a
    large set, in order to obtain the best linear prediction of another variable of
    interest. This problem can be viewed in the context of both feature selection
    and sparse approximation. We analyze the performance of widely used greedy
    heuristics, using insights from the maximization of submodular functions and
    spectral analysis. We introduce the \todef{submodularity ratio} as a key
    quantity to help understand why greedy algorithms perform well even when the
    variables are highly correlated.

  157. Sparse Signal Recovery with Temporally Correlated Source Vectors Using Sparse Bayesian Learning.

    Authors: Bhaskar D. Rao, Zhilin Zhang
    Subjects: Machine Learning
    Abstract

    We address the sparse signal recovery problem in the context of multiple
    measurement vectors (MMV) when elements in each nonzero row of the solution
    matrix are temporally correlated. Existing algorithms do not consider such
    temporal correlations and thus their performance degrades significantly with
    the correlations. In this work, we propose a block sparse Bayesian learning
    framework which models the temporal correlations.

  158. Robust Matrix Completion with Corrupted Columns.

    Authors: Huan Xu, Constantine Caramanis, Sujay Sanghavi, Yudong Chen
    Subjects: Machine Learning
    Abstract

    This paper considers the problem of matrix completion, when some number of
    the columns are arbitrarily corrupted, potentially by a malicious adversary. It
    is well-known that standard algorithms for matrix completion can return
    arbitrarily poor results, if even a single column is corrupted. What can be
    done if a large number, or even a constant fraction of columns are corrupted?
    In this paper, we study this very problem, and develop an efficient algorithm
    for its solution.

  159. How the result of graph clustering methods depends on the construction of the graph.

    Authors: Markus Maier, Matthias Hein, Ulrike von Luxburg
    Subjects: Machine Learning
    Abstract

    We study the scenario of graph-based clustering algorithms such as spectral
    clustering. Given a set of data points, one first has to construct a graph on
    the data points and then apply a graph clustering algorithm to find a suitable
    partition of the graph. Our main question is if and how the construction of the
    graph (choice of the graph, choice of parameters, choice of weights) influences
    the outcome of the final clustering result.

  160. An Introduction to Artificial Prediction Markets.

    Authors: Adrian Barbu, Nathan Lay
    Subjects: Machine Learning
    Abstract

    Prediction markets are used in real life to predict outcomes of interest such
    as presidential elections. This paper presents a mathematical theory of
    artificial prediction markets for supervised learning of conditional
    probability estimators. The artificial prediction market is a novel method for
    fusing the prediction information of features or trained classifiers, where the
    fusion result is the contract price on the possible outcomes.

  161. Semiparametric Latent Variable Models for Guided Representation.

    Authors: Ryan Prescott Adams, Jasper Snoek, Hugo Larochelle
    Subjects: Machine Learning
    Abstract

    Unsupervised discovery of latent representations, in addition to being useful
    for density modeling, visualisation and exploratory data analysis, is also
    increasingly important for learning features relevant to discriminative tasks.
    Autoencoders, in particular, have proven to be an effective way to learn latent
    codes that reflect meaningful variations in data. A continuing challenge,
    however, is guiding an autoencoder toward representations that are useful for
    particular discriminative tasks.

  162. Large Scale Correlation Screening.

    Authors: Alfred O. Hero, Bala Rajaratnam
    Subjects: Machine Learning
    Abstract

    This paper treats the problem of screening for variables with high
    correlations in high dimensional data in which there can be many fewer samples
    than variables. We focus on threshold-based correlation screening methods for
    three related applica- tions: screening for variables with large correlations
    within a single treatment (auto- correlation screening); screening for
    variables with large cross-correlations over two treatments (cross-correlation
    screening); screening for variables that have persistently large
    auto-correlations over two treatments (persistent-correlation screening).

  163. A convex model for non-negative matrix factorization and dimensionality reduction on physical space.

    Authors: Guillermo Sapiro, Ernie Esser, Michael M&#xf6;ller, Stanley Osher, Jack Xin
    Subjects: Machine Learning
    Abstract

    A collaborative convex framework for factoring a data matrix $X$ into a
    non-negative product $AS$, with a sparse coefficient matrix $S$, is proposed.
    We restrict the columns of the dictionary matrix $A$ to coincide with certain
    columns of the data matrix $X$, thereby guaranteeing a physically meaningful
    dictionary and dimensionality reduction. We use $l_{1,\infty}$ regularization
    to select the dictionary from the data and show this leads to an exact convex
    relaxation of $l_0$ in the case of distinct noise free data.

  164. Dependency detection with similarity constraints.

    Authors: Samuel Kaski, Leo Lahti, Samuel Myllykangas, Sakari Knuutila
    Subjects: Machine Learning
    Abstract

    Unsupervised two-view learning, or detection of dependencies between two
    paired data sets, is typically done by some variant of canonical correlation
    analysis (CCA). CCA searches for a linear projection for each view, such that
    the correlations between the projections are maximized. The solution is
    invariant to any linear transformation of either or both of the views; for
    tasks with small sample size such flexibility implies overfitting, which is
    even worse for more flexible nonparametric or kernel-based dependency discovery
    methods.

  165. An Analysis of the Convergence of Graph Laplacians.

    Authors: Ling Huang, Daniel Ting, Michael Jordan
    Subjects: Machine Learning
    Abstract

    Existing approaches to analyzing the asymptotics of graph Laplacians
    typically assume a well-behaved kernel function with smoothness assumptions. We
    remove the smoothness assumption and generalize the analysis of graph
    Laplacians to include previously unstudied graphs including kNN graphs. We also
    introduce a kernel-free framework to analyze graph constructions with shrinking
    neighborhoods in general and apply it to analyze locally linear embedding
    (LLE).

  166. Bayesian Network Structure Learning with Permutation Tests.

    Authors: Marco Scutari, Adriana Brogini
    Subjects: Machine Learning
    Abstract

    In literature there are several studies on the performance of Bayesian
    network structure learning algorithms. The focus of these studies is almost
    always the heuristics learning algorithms are based on, i.e. the maximization
    algorithms used in score-based algorithms or the techniques for learning the
    dependencies of each variable in constraint-based algorithms.

  167. Reproducing Kernel Banach Spaces with the l1 Norm II: Error Analysis for Regularized Least Square Regression.

    Authors: Guohui Song, Haizhang Zhang
    Subjects: Machine Learning
    Abstract

    A typical approach in estimating the learning rate of a regularized learning
    scheme is to bound the approximation error by the sum of the sampling error,
    the hypothesis error and the regularization error. Using a reproducing kernel
    space that satisfies the linear representer theorem brings the advantage of
    discarding the hypothesis error from the sum automatically. Following this
    direction, we illustrate how reproducing kernel Banach spaces with the l1 norm
    can be applied to improve the learning rate estimate of l1-regularization in
    machine learning.

  168. Reproducing Kernel Banach Spaces with the l1 Norm.

    Authors: Guohui Song, Haizhang Zhang, Fred J. Hickernell
    Subjects: Machine Learning
    Abstract

    Targeting at sparse learning, we construct Banach spaces B of functions on an
    input space X with the properties that (1) B possesses an l1 norm in the sense
    that it is isometrically isomorphic to the Banach space of integrable functions
    on X with respect to the counting measure; (2) point evaluations are continuous
    linear functionals on B and are representable through a bilinear form with a
    kernel function; (3) regularized learning schemes on B satisfy the linear
    representer theorem. Examples of kernel functions admissible for the
    construction of such spaces are given.

  169. Convergence Rates of Efficient Global Optimization Algorithms.

    Authors: Adam D. Bull
    Subjects: Machine Learning
    Abstract

    Efficient global optimization is the problem of minimizing an unknown
    function f, using as few evaluations f(x) as possible. It can be considered as
    a continuum-armed bandit problem, with noiseless data and simple regret.
    Expected improvement is perhaps the most popular method for solving this
    problem; the algorithm performs well in experiments, but little is known about
    its theoretical properties. Implementing expected improvement requires a choice
    of Gaussian process prior, which determines an associated space of functions,
    its reproducing-kernel Hilbert space (RKHS).

  170. Reading Dependencies from Covariance Graphs.

    Authors: Jose M. Pe&#xf1;a
    Subjects: Machine Learning
    Abstract

    The covariance graph (aka bi-directed graph) of a probability distribution
    $p$ is the undirected graph $G$ where two nodes are adjacent iff their
    corresponding random variables are marginally dependent in $p$. In this paper,
    we present a graphical criterion for reading dependencies from $G$, under the
    assumption that $p$ satisfies the graphoid properties as well as weak
    transitivity and composition. We prove that the graphical criterion is sound
    and complete in certain sense. We argue that our assumptions are not too
    restrictive.

  171. DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model.

    Authors: Shohei Shimizu, Patrik O. Hoyer, Yoshinobu Kawahara, Kenneth Bollen, Takashi Washio, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvarinen
    Subjects: Machine Learning
    Abstract

    Structural equation models and Bayesian networks have been widely used to
    analyze causal relations between continuous variables. In such frameworks,
    linear acyclic models are typically used to model the data-generating process
    of variables. Recently, it was shown that use of non-Gaussianity identifies the
    full structure of a linear acyclic model, i.e., a causal ordering of variables
    and their connection strengths, without using any prior knowledge on the
    network structure, which is not the case with conventional methods.

  172. A Correction of "Deriving a Minimal I-Map of a Belief Network Relative to a Target Ordering of its Nodes".

    Authors: Jose M. Pe&#xf1;a
    Subjects: Machine Learning
    Abstract

    Matzkevich and Abramson (1993b) develop two algorithms, called Methods A and
    B, for efficiently deriving the minimal directed independence map of a directed
    and acyclic graph relative to a given node ordering. Methods A and B are
    claimed to be correct although no proof is provided (a proof is just sketched).
    In this paper, we show that Methods A and B are not correct and propose a
    correction of them.

  173. Sparsity regret bounds for individual sequences in online linear regression.

    Authors: S&#xe9;bastien Gerchinovitz
    Subjects: Machine Learning
    Abstract

    We consider the problem of online linear regression on arbitrary
    deterministic sequences when the ambient dimension $d$ can be much larger than
    the number of time rounds $T$. In this framework we prove deterministic online
    counterparts of the so-called sparsity oracle inequalities introduced in the
    stochastic setting in the past decade. They indicate that the task consisting
    in predicting almost as well as an unknown high-dimensional target vector is
    still statistically feasible if this target vector has only few non-zero
    coordinates.

  174. Ordinal Discriminant Analysis: A new approach to the construction of optimal linear scores for ordinal risk categories.

    Authors: Yizhar Toren
    Subjects: Machine Learning
    Abstract

    Most classification methods provide either a prediction of group membership
    or an assessment of class membership probability. In the case of two-group
    classification the predicted probability can be described as "risk" of
    belonging to a special group.

  175. Estimating Networks With Jumps.

    Authors: Eric P. Xing, Mladen Kolar
    Subjects: Machine Learning
    Abstract

    We study the problem of estimating a temporally varying coefficient and
    varying structure (VCVS) graphical model underlying nonstationary time series
    data, such as social states of interacting individuals or microarray expression
    profiles of gene networks, as opposed to i.i.d. data from an invariant model
    widely considered in current literature of structural estimation. In
    particular, we consider the scenario in which the model evolves in a piece-wise
    constant fashion.

  176. lp-Recovery of the Most Significant Subspace among Multiple Subspaces with Outliers.

    Authors: Teng Zhang, Gilad Lerman
    Subjects: Machine Learning
    Abstract

    We assume data sampled from a mixture of d-dimensional linear subspaces with
    outliers distributed symmetrically around the origin. We study the recovery of
    the global l0 subspace (i.e., with largest number of points) by minimizing the
    lp-averaged distances of data points from d-dimensional subspaces of R^D, where
    p>0. Unlike other lp minimization problems, this minimization is non-convex for
    all p>0 and thus requires different methods for its analysis. We show that if
    0<p<=1, then the global l0 subspace can be recovered by lp minimization with
    overwhelming probability.

  177. Fast Convergent Algorithms for Expectation Propagation Approximate Bayesian Inference.

    Authors: Hannes Nickisch, Matthias W. Seeger
    Subjects: Machine Learning
    Abstract

    We propose a novel algorithm to solve the expectation propagation relaxation
    of Bayesian inference for continuous-variable graphical models. In contrast to
    most previous algorithms, our method is provably convergent. By marrying
    convergent EP ideas from (Opper&Winther 05) with covariance decoupling
    techniques (Wipf&Nagarajan 08, Nickisch&Seeger 09), it runs at least an order
    of magnitude faster than the most commonly used EP solver.

  178. Adaptive Parallel Tempering for Stochastic Maximum Likelihood Learning of RBMs.

    Authors: Guillaume Desjardins, Aaron Courville, Yoshua Bengio
    Subjects: Machine Learning
    Abstract

    Restricted Boltzmann Machines (RBM) have attracted a lot of attention of
    late, as one the principle building blocks of deep networks. Training RBMs
    remains problematic however, because of the intractibility of their partition
    function. The maximum likelihood gradient requires a very robust sampler which
    can accurately sample from the model despite the loss of ergodicity often
    incurred during learning.

  179. Cross-Domain Object Matching with Model Selection.

    Authors: Masashi Sugiyama, Makoto Yamada
    Subjects: Machine Learning
    Abstract

    The goal of cross-domain object matching (CDOM) is to find correspondence
    between two sets of objects in different domains in an unsupervised way. Photo
    album summarization is a typical application of CDOM, where photos are
    automatically aligned into a designed frame expressed in the Cartesian
    coordinate system. CDOM is usually formulated as finding a mapping from objects
    in one domain (photos) to objects in the other domain (frame) so that the
    pairwise dependency is maximized.

  180. Sparse Inverse Covariance Estimation via the Split Bregman Method.

    Authors: Gui-Bo Ye, Xiaohui Xie, Jian-Feng Cai
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning the structure of graphical models by
    estimating the inverse covariance matrix with sparsity regularization. We
    develop a new method based on split Bregman to solve the sparse inverse
    covariance estimation problem. We show that our method is significantly faster
    than the widely used glasso method, which is based on blockwise coordinate
    descent. In addition, our method is easy to implement and can be applied to a
    general class of regularization terms.

  181. A ROAD to Classification in High Dimensional Space.

    Authors: Jianqing Fan, Yang Feng, Xin Tong
    Subjects: Machine Learning
    Abstract

    For high-dimensional classification, it is well known that naively performing
    the Fisher discriminant rule leads to poor results due to diverging spectra and
    noise accumulation. Therefore, researchers proposed independence rules to
    circumvent the diverse spectra, and sparse independence rules to mitigate the
    issue of noise accumulation. However, in biological applications, there are
    often a group of correlated genes responsible for clinical outcomes, and the
    use of the covariance information can significantly reduce misclassification
    rates.

  182. In All Likelihood, Deep Belief Is Not Enough.

    Authors: Lucas Theis, Sebastian Gerwinn, Fabian Sinz, Matthias Bethge
    Subjects: Machine Learning
    Abstract

    Statistical models of natural stimuli provide an important tool for
    researchers in the fields of machine learning and computational neuroscience. A
    canonical way to quantitatively assess and compare the performance of
    statistical models is given by the likelihood. One class of statistical models
    which has recently gained increasing popularity and has been applied to a
    variety of complex data are deep belief networks.

  183. Graphical Models as Block-Tree Graphs.

    Authors: Jose M. F. Moura, Divyanshu Vats
    Subjects: Machine Learning
    Abstract

    We introduce block-tree graphs as a framework for deriving efficient
    algorithms on graphical models. We define block-tree graphs as a
    tree-structured graph where each node is a cluster of nodes such that the
    clusters in the graph are disjoint. This differs from junction-trees, where two
    clusters connected by an edge always have at least one common node.

  184. Learning Planar Ising Models.

    Authors: Michael Chertkov, Jason K. Johnson, Praneeth Netrapalli
    Subjects: Machine Learning
    Abstract

    Inference and learning of graphical models are both well-studied problems in
    statistics and machine learning that have found many applications in science
    and engineering. However, exact inference is intractable in general graphical
    models, which suggests the problem of seeking the best approximation to a
    collection of random variables within some tractable family of graphical
    models. In this paper, we focus our attention on the class of planar Ising
    models, for which inference is tractable using techniques of statistical
    physics [Kac and Ward; Kasteleyn].

  185. Regularization Strategies and Empirical Bayesian Learning for MKL.

    Authors: Taiji Suzuki, Ryota Tomioka
    Subjects: Machine Learning
    Abstract

    Multiple kernel learning (MKL) has received considerable attention recently.
    In this paper, we show how different MKL algorithms can be understood as
    applications of different types of regularization on the kernel weights. Within
    the regularization view we consider in this paper, the
    Tikhonov-regularization-based formulation of MKL allows us to consider a
    generative probabilistic model behind MKL. Based on this model, we propose
    learning algorithms for the kernel weights through the maximization of
    marginalized likelihood.

  186. Online Learning: Beyond Regret.

    Authors: Ambuj Tewari, Karthik Sridharan, Alexander Rakhlin
    Subjects: Machine Learning
    Abstract

    We study online learnability of a wide class of problems, extending the
    results of (Rakhlin, Sridharan, Tewari, 2010) to general notions of performance
    measure well beyond external regret. Our framework simultaneously captures such
    well-known notions as internal and general Phi-regret, learning with
    non-additive global cost functions, Blackwell's approachability, calibration of
    forecasters, adaptive regret, and more.

  187. Stability of Density-Based Clustering.

    Authors: Larry Wasserman, Aarti Singh, Alessandro Rinaldo, Rebecca Nugent
    Subjects: Machine Learning
    Abstract

    High density clusters can be characterized by the connected components of a
    level set $L(\lambda) = \{x:\ p(x)>\lambda\}$ of the underlying probability
    density function $p$ generating the data, at some appropriate level
    $\lambda\geq 0$. The complete hierarchical clustering can be characterized by a
    cluster tree ${\cal T}= \bigcup_{\lambda} L(\lambda)$. In this paper, we study
    the behavior of a density level set estimate $\widehat L(\lambda)$ and cluster
    tree estimate $\widehat{\cal{T}}$ based on a kernel density estimator with
    kernel bandwidth $h$.

  188. Robust Matrix Decomposition with Outliers.

    Authors: Tong Zhang, Sham M. Kakade, Daniel Hsu
    Subjects: Machine Learning
    Abstract

    Suppose a given observation matrix can be decomposed as the sum of a low-rank
    matrix and a sparse matrix (outliers), and the goal is to recover these
    individual components from the observed sum. Such additive decompositions have
    applications in a variety of numerical problems including system
    identification, latent variable graphical modeling, and principal components
    analysis. We study conditions under which recovering such a decomposition is
    possible via a combination of $\ell_1$ norm and trace norm minimization.

  189. The Lasso under Heteroscedasticity.

    Authors: Bin Yu, Karl Rohe, Jinzhu Jia
    Subjects: Machine Learning
    Abstract

    The performance of the Lasso is well understood under the assumptions of the
    standard linear model with homoscedastic noise. However, in several
    applications, the standard model does not describe the important features of
    the data. This paper examines how the Lasso performs on a non-standard model
    that is motivated by medical imaging applications. In these applications, the
    variance of the noise scales linearly with the expectation of the observation.
    Like all heteroscedastic models, the noise terms in this Poisson-like model are
    \textit{not} independent of the design matrix.

  190. Community Detection in Networks: The Leader-Follower Algorithm.

    Authors: Devavrat Shah, Tauhid Zaman
    Subjects: Machine Learning
    Abstract

    Traditional spectral clustering methods cannot naturally learn the number of
    communities in a network and often fail to detect smaller community structure
    in dense networks because they are based upon external community connectivity
    properties such as graph cuts. We propose an algorithm for detecting community
    structure in networks called the leader-follower algorithm which is based upon
    the natural internal structure expected of communities in social networks.

  191. From Sparse Signals to Sparse Residuals for Robust Sensing.

    Authors: Georgios B. Giannakis, Vassilis Kekatos
    Subjects: Machine Learning
    Abstract

    One of the key challenges in sensor networks is the extraction of information
    by fusing data from a multitude of distinct, but possibly unreliable sensors.
    Recovering information from the maximum number of dependable sensors while
    specifying the unreliable ones is critical for robust sensing. This sensing
    task is formulated here as that of finding the maximum number of feasible
    subsystems of linear equations, and proved to be NP-hard. Useful links are
    established with compressive sampling, which aims at recovering vectors that
    are sparse.

  192. Concentration inequalities of the cross-validation estimator for Empirical Risk Minimiser.

    Authors: Matthieu Cornec
    Subjects: Machine Learning
    Abstract

    In this article, we derive concentration inequalities for the
    cross-validation estimate of the generalization error for empirical risk
    minimizers. In the general setting, we prove sanity-check bounds in the spirit
    of \cite{KR99} \textquotedblleft\textit{bounds showing that the worst-case
    error of this estimate is not much worse that of training error estimate}
    \textquotedblright . General loss functions and class of predictors with finite
    VC-dimension are considered.

  193. f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models.

    Authors: Taiji Suzuki, Masashi Sugiyama, Takafumi Kanamori
    Subjects: Machine Learning
    Abstract

    A density ratio is defined by the ratio of two probability densities. We
    study the inference problem of density ratios and apply a semi-parametric
    density-ratio estimator to the two-sample homogeneity test. In the proposed
    test procedure, the f-divergence between two probability densities is estimated
    using a density-ratio estimator. The f-divergence estimator is then exploited
    for the two-sample homogeneity test.

  194. Estimating time-varying networks.

    Authors: Amr Ahmed, Eric P. Xing, Mladen Kolar, Le Song
    Subjects: Machine Learning
    Abstract

    Stochastic networks are a plausible representation of the relational
    information among entities in dynamic systems such as living cells or social
    communities. While there is a rich literature in estimating a static or
    temporally invariant network from observation data, little has been done toward
    estimating time-varying networks from time series of entity attributes.

  195. Maximum Likelihood Joint Tracking and Association in a Strong Clutter without Combinatorial Complexity.

    Authors: Leonid I. Perlovsky, Ross W. Deming
    Subjects: Machine Learning
    Abstract

    We have developed an efficient algorithm for the maximum likelihood joint
    tracking and association problem in a strong clutter for GMTI data. By using an
    iterative procedure of the dynamic logic process "from vague-to-crisp," the new
    tracker overcomes combinatorial complexity of tracking in highly-cluttered
    scenarios and results in a significant improvement in signal-to-clutter ratio.

  196. A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures.

    Authors: Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky, Lang Tong
    Subjects: Machine Learning
    Abstract

    The problem of maximum-likelihood (ML) estimation of discrete tree-structured
    distributions is considered. Chow and Liu established that ML-estimation
    reduces to the construction of a maximum-weight spanning tree using the
    empirical mutual information quantities as the edge weights. Using the theory
    of large-deviations, we analyze the exponent associated with the error
    probability of the event that the ML-estimate of the Markov tree structure
    differs from the true tree structure, given a set of independently drawn
    samples.

  197. Exact block-wise optimization in group lasso for linear regression.

    Authors: Mathias Drton, Rina Foygel
    Subjects: Machine Learning
    Abstract

    The group lasso is a penalized regression method, used in regression problems
    where the covariates are partitioned into groups to promote sparsity at the
    group level. Existing methods for finding the group lasso estimator either use
    gradient projection methods to update the entire coefficient vector
    simultaneously at each step, or update one group of coefficients at a time
    using an inexact line search to approximate the optimal value for the group of
    coefficients when all other groups' coefficients are fixed.

  198. Online Multiple Kernel Learning for Structured Prediction.

    Authors: Mario A. T. Figueiredo, Eric P. Xing, Andre F.T. Martins, Pedro M. Q. Aguiar, Noah A. Smith
    Subjects: Machine Learning
    Abstract

    Despite the recent progress towards efficient multiple kernel learning (MKL),
    the structured output case remains an open research front. Current approaches
    involve repeatedly solving a batch learning problem, which makes them
    inadequate for large scale scenarios. We propose a new family of online
    proximal algorithms for MKL (as well as for group-lasso and variants thereof),
    which overcomes that drawback. We show regret, convergence, and generalization
    bounds for the proposed method.

  199. Infinite Hierarchical MMSB Model for Nested Communities/Groups in Social Networks.

    Authors: Eric P. Xing, Le Song, Qirong Ho, Ankur P. Parikh
    Subjects: Machine Learning
    Abstract

    Actors in realistic social networks play not one but a number of diverse
    roles depending on whom they interact with, and a large number of such
    role-specific interactions collectively determine social communities and their
    organizations. Methods for analyzing social networks should capture these
    multi-faceted role-specific interactions, and, more interestingly, discover the
    latent organization or hierarchy of social communities.

  200. On the extension of trace norm to tensors.

    Authors: Ryota Tomioka, Kohei Hayashi, Hisashi Kashima
    Subjects: Machine Learning
    Abstract

    In this paper, we propose three extensions of trace norm for the minimization
    of tensor rank via convex optimization. The alternating direction method of
    multipliers is used to efficiently solve the optimization problems. One of the
    proposed extensions recovers partially observed tensor almost perfectly from a
    small fraction of observations.

  201. A bagging SVM to learn from positive and unlabeled examples.

    Authors: Jean-Philippe Vert, Fantine Mordelet
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning a binary classifier from a training set
    of positive and unlabeled examples, both in the inductive and in the
    transductive setting. This problem, often referred to as \emph{PU learning},
    differs from the standard supervised classification problem by the lack of
    negative examples in the training set.

  202. Local Optimality of User Choices and Collaborative Competitive Filtering.

    Authors: Shuang Hong Yang
    Subjects: Machine Learning
    Abstract

    We describe a novel framework for learning recommender models for
    recommendation systems, which views user-system-item interactions as an
    opportunity give-and-take process, and encodes both "collaboration" and
    "competition" mechanisms underlying the interaction. The proposed framework
    leverages the latent factor models of collaborative filtering to encode
    "collaboration" (via factor sharing); and in the meanwhile, it utilizes a type
    of objectives that implies local optimality of user choices to encode
    "competition".

  203. Regularizers for Structured Sparsity.

    Authors: Massimiliano Pontil, Charles A. Micchelli, Jean M. Morales
    Subjects: Machine Learning
    Abstract

    We study the problem of learning a sparse linear regression vector under
    additional conditions on the structure of its sparsity pattern. This problem is
    relevant in machine learning, statistics and signal processing. It is well
    known that a linear regression can benefit from the knowledge that the
    underlying regression vector is sparse. The combinatorial problem of selecting
    the nonzero components of this vector can be ``relaxed'' by regularizing the
    squared error with a convex penalty function like the $\ell_1$ norm.

  204. Asymptotic Normality of Support Vector Machines for Classi?cation and Regression.

    Authors: Robert Hable
    Subjects: Machine Learning
    Abstract

    In nonparametric classification and regression problems, support vector
    machines (SVMs) attract much attention in theoretical and in applied
    statistics. In an abstract sense, SVMs can be seen as regularized M-estimators
    for a parameter in a (typically infinite dimensional) reproducing kernel
    Hilbert space. In the article, it is shown that the difference between the
    empirical SVM and the theoretical SVM is asymptotically normal with rate
    $\sqrt{n}$. That is, the standardized difference converges weakly to a Gaussian
    process in the reproducing kernel Hilbert space.

  205. Kernel Bayes' rule.

    Authors: Kenji Fukumizu, Arthur Gretton, Le Song
    Subjects: Machine Learning
    Abstract

    A kernel method is proposed for realizing Bayes' rule, based on
    representations of probability distributions in reproducing kernel Hilbert
    spaces (RKHS). The empirical RKHS embeddings of the conditional probabilities
    and prior are expressed as feature mappings of samples, and an RKHS embedding
    of the posterior distribution is computed, again based on a feature mapping of
    a sample. This kernel Bayes' rule can be applied to a wide variety of
    nonparametric Bayesian inference problems. As an example, the approach is used
    in filtering with a nonparametric state-space model.

  206. Task-Driven Dictionary Learning.

    Authors: Francis Bach, Julien Mairal, Jean Ponce
    Subjects: Machine Learning
    Abstract

    Modeling data with linear combinations of a few elements from a learned
    dictionary has been the focus of much recent research in machine learning,
    neuroscience and signal processing. For signals such as natural images that
    admit such sparse representations, it is now well established that these models
    are well suited to restoration tasks. In this context, learning the dictionary
    amounts to solving a large-scale matrix factorization problem, which can be
    done efficiently with classical optimization tools.

  207. Learning Latent Tree Graphical Models.

    Authors: Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky, Myung Jin Choi
    Subjects: Machine Learning
    Abstract

    We study the problem of learning a latent tree graphical model where samples
    are available only from a subset of variables. We propose two consistent and
    computationally efficient algorithms for learning minimal latent trees, that
    is, trees without any redundant hidden nodes. Unlike many existing methods, the
    observed nodes (or variables) are not constrained to be leaf nodes. Our first
    algorithm, recursive grouping, builds the latent tree recursively by
    identifying sibling groups using so-called information distances.

  208. Calibrated Surrogate Losses for Classification with Label-Dependent Costs.

    Authors: Clayton Scott
    Subjects: Machine Learning
    Abstract

    We present surrogate regret bounds for arbitrary surrogate losses in the
    context of binary classification with label-dependent costs. Such bounds relate
    a classifier's risk, assessed with respect to a surrogate loss, to its
    cost-sensitive classification risk. Two approaches to surrogate regret bounds
    are developed. The first is a direct generalization of Bartlett et al.

  209. Efficient Bayesian Community Detection using Non-negative Matrix Factorisation.

    Authors: Ioannis Psorakis, Stephen Roberts, Ben Sheldon
    Subjects: Machine Learning
    Abstract

    Identifying overlapping communities in networks is a challenging task. In
    this work we present a novel approach to community detection that utilises the
    Bayesian non-negative matrix factorisation (NMF) model to produce a
    probabilistic output for node memberships. The scheme has the advantage of
    computational efficiency, soft community membership and an intuitive
    foundation. We present the performance of the method against a variety of
    benchmark problems and compare and contrast it to several other algorithms for
    community detection.

  210. Proximal Methods for Hierarchical Sparse Coding.

    Authors: Francis Bach, Rodolphe Jenatton, Guillaume Obozinski, Julien Mairal
    Subjects: Machine Learning
    Abstract

    Sparse coding consists in representing signals as sparse linear combinations
    of atoms selected from a dictionary. We consider an extension of this framework
    where the atoms are further assumed to be embedded in a tree. This is achieved
    using a recently introduced tree-structured sparse regularization norm, which
    has proven useful in several applications. This norm leads to regularized
    problems that are difficult to optimize, and we propose in this paper efficient
    algorithms for solving them.

  211. Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data.

    Authors: Tran The Truyen, Dinh Q. Phung, Svetha Venkatesh, Hung H. Bui
    Subjects: Machine Learning
    Abstract

    Inspired by the hierarchical hidden Markov models (HHMM), we present the
    hierarchical semi-Markov conditional random field (HSCRF), a generalisation of
    embedded undirectedMarkov chains tomodel complex hierarchical, nestedMarkov
    processes. It is parameterised in a discriminative framework and has polynomial
    time algorithms for learning and inference. Importantly, we consider
    partiallysupervised learning and propose algorithms for generalised
    partially-supervised learning and constrained inference.

  212. Optimizing an Organized Modularity Measure for Topographic Graph Clustering: a Deterministic Annealing Approach.

    Authors: Fabrice Rossi, Nathalie Villa-Vialaneix
    Subjects: Machine Learning
    Abstract

    This paper proposes an organized generalization of Newman and Girvan's
    modularity measure for graph clustering. Optimized via a deterministic
    annealing scheme, this measure produces topologically ordered graph clusterings
    that lead to faithful and readable graph representations based on clustering
    induced graphs. Topographic graph clustering provides an alternative to more
    classical solutions in which a standard graph clustering method is applied to
    build a simpler graph that is then represented with a graph layout algorithm.

  213. On the Estimation of Coherence.

    Authors: Ameet Talwalkar, Mehryar Mohri
    Subjects: Machine Learning
    Abstract

    Low-rank matrix approximations are often used to help scale standard machine
    learning algorithms to large-scale problems. Recently, matrix coherence has
    been used to characterize the ability to extract global information from a
    subset of matrix entries in the context of these low-rank approximations and
    other sampling-based algorithms, e.g., matrix com- pletion, robust PCA.

  214. Information-theoretic lower bounds on the oracle complexity of convex optimization.

    Authors: Martin J. Wainwright, Pradeep Ravikumar, Peter L. Bartlett, Alekh Agarwal
    Subjects: Machine Learning
    Abstract

    Relative to the large literature on upper bounds on complexity of convex
    optimization, lesser attention has been paid to the fundamental hardness of
    these problems. Given the extensive use of convex optimization in machine
    learning and statistics, gaining an understanding of these complexity-theoretic
    issues is important. In this paper, we study the complexity of stochastic
    convex optimization in an oracle model of computation. We improve upon known
    results and obtain tight minimax complexity estimates for various function
    classes.

  215. High-dimensional covariance estimation based on Gaussian graphical models.

    Authors: Shuheng Zhou, Peter Buhlmann, Philipp Rutimann, Min Xu
    Subjects: Machine Learning
    Abstract

    Undirected graphs are often used to describe high dimensional distributions.
    Under sparsity conditions, the graph can be estimated using
    {\ell}1-penalization methods. We propose and study the following method. We
    combine a multiple regression approach with ideas of thresholding and
    refitting: first we infer a sparse undirected graphical model structure via
    thresholding of each among many {\ell}1-norm penalized regression functions; we
    then estimate the covariance matrix and its inverse using the maximum
    likelihood estimator.

  216. Mixed Cumulative Distribution Networks.

    Authors: Ricardo Silva, Charles Blundell, Yee Whye Teh
    Subjects: Machine Learning
    Abstract

    Directed acyclic graphs (DAGs) are a popular framework to express
    multivariate probability distributions. Acyclic directed mixed graphs (ADMGs)
    are generalizations of DAGs that can succinctly capture much richer sets of
    conditional independencies, and are especially useful in modeling the effects
    of latent variables implicitly. Unfortunately there are currently no good
    parameterizations of general ADMGs. In this paper, we apply recent work on
    cumulative distribution networks and copulas to propose one one general
    construction for ADMG models.

  217. Union Support Recovery in Multi-task Learning.

    Authors: Larry Wasserman, John Lafferty, Mladen Kolar
    Subjects: Machine Learning
    Abstract

    We sharply characterize the performance of different penalization schemes for
    the problem of selecting the relevant variables in the multi-task setting.
    Previous work focuses on the regression problem where conditions on the design
    matrix complicate the analysis. A clearer and simpler picture emerges by
    studying the Normal means model. This model, often used in the field of
    statistics, is a simplified model that provides a laboratory for studying
    complex procedures.

  218. Brain covariance selection: better individual functional connectivity models using population prior.

    Authors: Ga&#xeb;l Varoquaux, Jean Baptiste Poline, Bertrand Thirion, Alexandre Gramfort
    Subjects: Machine Learning
    Abstract

    Spontaneous brain activity, as observed in functional neuroimaging, has been
    shown to display reproducible structure that expresses brain architecture and
    carries markers of brain pathologies. An important view of modern neuroscience
    is that such large-scale structure of coherent activity reflects modularity
    properties of brain connectivity graphs. However, to date, there has been no
    demonstration that the limited and noisy data available in spontaneous activity
    observations could be used to learn full-brain probabilistic models that
    generalize to new data.

  219. Sparse Group Restricted Boltzmann Machines.

    Authors: Heng Luo, Ruimin Shen, Cahngyong Niu
    Subjects: Machine Learning
    Abstract

    Since learning is typically very slow in Boltzmann machines, there is a need
    to restrict connections within hidden layers. However, the resulting states of
    hidden units exhibit statistical dependencies. Based on this observation, we
    propose using $l_1/l_2$ regularization upon the activation possibilities of
    hidden units in restricted Boltzmann machines to capture the loacal
    dependencies among hidden units.

  220. Entropy-Based Search Algorithm for Experimental Design.

    Authors: N. K. Malakar, K. H. Knuth
    Subjects: Machine Learning
    Abstract

    The scientific method relies on the iterated processes of inference and
    inquiry. The inference phase consists of selecting the most probable models
    based on the available data; whereas the inquiry phase consists of using what
    is known about the models to select the most relevant experiment. Optimizing
    inquiry involves searching the parameterized space of experiments to select the
    experiment that promises, on average, to be maximally informative.

  221. Kernel induced random survival forests.

    Authors: Jiheng Wang, Guangzhe Fan, Fang Yang
    Subjects: Machine Learning
    Abstract

    Kernel Induced Random Survival Forests (KIRSF) is a statistical learning
    algorithm which aims to improve prediction accuracy for survival data. As in
    Random Survival Forests (RSF), Cumulative Hazard Function is predicted for each
    individual in the test set. Prediction error is estimated using Harrell's
    concordance index (C index) [Harrell et al. (1982)]. The C-index can be
    interpreted as a misclassification probability and does not depend on a single
    fixed time for evaluation. The C-index also specifically accounts for
    censoring.

  222. A Simple Kernel-based Nearest Neighbor Method for the MNIST Database of Handwritten Digits Classification.

    Authors: Jiheng Wang, Zhou Wang, Guangzhe Fan
    Subjects: Machine Learning
    Abstract

    In this paper we propose a series of simple and fast kernel-based nearest
    neighbor classification algorithms based on CW-SSIM index for the MNIST
    database, which appears to be effective and reliable tools for the MNIST
    Database of Handwritten Digits Classification. Given that the CW-SSIM index
    provides a powerful similarity measure between two misaligned images and there
    are sufficient training examples in the MNIST database, we obtain amazing
    results with employing the simplest k-NN model, i.e., only using most similar
    images to classify test examples.

  223. A unifying view for performance measures in multi-class prediction.

    Authors: Giuseppe Jurman, Cesare Furlanello
    Subjects: Machine Learning
    Abstract

    In the last few years, many different performance measures have been
    introduced to overcome the weakness of the most natural metric, the Accuracy.
    Among them, Matthews Correlation Coefficient has recently gained popularity
    among researchers not only in machine learning but also in several application
    fields such as bioinformatics. Nonetheless, further novel functions are being
    proposed in literature.

  224. PMOG: The projected mixture of Gaussians model with application to blind source separation.

    Authors: Gautam V. Pendse
    Subjects: Machine Learning
    Abstract

    We extend the mixtures of Gaussians (MOG) model to the projected mixture of
    Gaussians (PMOG) model. In the PMOG model, we assume that q dimensional input
    data points z_i are projected by a q dimensional vector w into 1-D variables
    u_i. The projected variables u_i are assumed to follow a 1-D MOG model. In the
    PMOG model, we maximize the likelihood of observing u_i to find both the model
    parameters for the 1-D MOG as well as the projection vector w. First, we derive
    an EM algorithm for estimating the PMOG model.

  225. Large Scale Variational Inference and Experimental Design for Sparse Generalized Linear Models.

    Authors: Hannes Nickisch, Matthias W. Seeger
    Subjects: Machine Learning
    Abstract

    Many problems of low-level computer vision and image processing, such as
    denoising, deconvolution, tomographic reconstruction or super-resolution, can
    be addressed by maximizing the posterior distribution of a sparse linear model
    (SLM). We show how higher-order Bayesian decision-making problems, such as
    optimizing image acquisition in magnetic resonance scanners, can be addressed
    by querying the SLM posterior covariance, unrelated to the density's mode.

  226. Faithfulness in Chain Graphs: The Gaussian Case.

    Authors: Jose M. Pe&#xf1;a
    Subjects: Machine Learning
    Abstract

    This paper deals with chain graphs under the classic
    Lauritzen-Wermuth-Frydenberg interpretation. We prove that the regular Gaussian
    distributions that factorize with respect to a chain graph $G$ with $d$
    parameters have positive Lebesgue measure with respect to $\mathbb{R}^d$,
    whereas those that factorize with respect to $G$ but are not faithful to it
    have zero Lebesgue measure with respect to $\mathbb{R}^d$. This means that, in
    the measure-theoretic sense described, almost all the regular Gaussian
    distributions that factorize with respect to $G$ are faithful to it.

  227. Discovering shared and individual latent structure in multiple time series.

    Authors: Suchi Saria, Daphne Koller, Anna Penn
    Subjects: Machine Learning
    Abstract

    This paper proposes a nonparametric Bayesian method for exploratory data
    analysis and feature construction in continuous time series. Our method focuses
    on understanding shared features in a set of time series that exhibit
    significant individual variability. Our method builds on the framework of
    latent Diricihlet allocation (LDA) and its extension to hierarchical Dirichlet
    processes, which allows us to characterize each series as switching between
    latent ``topics'', where each topic is characterized as a distribution over
    ``words'' that specify the series dynamics.

  228. Algorithmic Detection of Computer Generated Text.

    Authors: Allen Lavoie, Mukkai Krishnamoorthy
    Subjects: Machine Learning
    Abstract

    Computer generated academic papers have been used to expose a lack of
    thorough human review at several computer science conferences. We assess the
    problem of classifying such documents. After identifying and evaluating several
    quantifiable features of academic papers, we apply methods from machine
    learning to build a binary classifier.

  229. Support Vector Machines for Additive Models: Consistency and Robustness.

    Authors: Robert Hable, Andreas Christmann
    Subjects: Machine Learning
    Abstract

    Support vector machines (SVMs) are special kernel based methods and belong to
    the most successful learning methods since more than a decade. SVMs can
    informally be described as a kind of regularized M-estimators for functions and
    have demonstrated their usefulness in many complicated real-life problems.
    During the last years a great part of the statistical research on SVMs has
    concentrated on the question how to design SVMs such that they are universally
    consistent and statistically robust for nonparametric classification or
    nonparametric regression purposes.

  230. A generalized risk-based approach to segmentation based on hidden Markov models.

    Authors: J&#xfc;ri Lember, Alexey A. Koloydenko
    Subjects: Machine Learning
    Abstract

    Motivated by the continuing interest in discrete time hidden Markov models
    (HMMs), this paper reexamines these models using a risk-based approach. Simple
    modifications of the classical optimization criteria for hidden path inference
    lead to a new class of hidden path estimators. The estimators are efficiently
    computed in the usual forward-backward manner and a corresponding dynamic
    programming algorithm is also presented.

  231. Directional Statistics on Permutations.

    Authors: Sergey M. Plis, Terran Lane, Vince D. Calhoun
    Subjects: Machine Learning
    Abstract

    Distributions over permutations arise in applications ranging from
    multi-object tracking to ranking of instances. The difficulty of dealing with
    these distributions is caused by the size of their domain, which is factorial
    in the number of considered entities ($n!$). It makes the direct definition of
    a multinomial distribution over permutation space impractical for all but a
    very small $n$. In this work we propose an embedding of all $n!$ permutations
    for a given $n$ in a surface of a hypersphere defined in
    $\mathbbm{R}^{(n-1)^2}$.

  232. Spectral clustering and the high-dimensional Stochastic Block Model.

    Authors: Sourav Chatterjee, Bin Yu, Karl Rohe
    Subjects: Machine Learning
    Abstract

    Networks or graphs can easily represent a diverse set of data sources that
    are characterized by interacting units or actors. Social networks, representing
    people who communicate with each other, are one example. Communities or
    clusters of highly connected actors form an essential feature in the structure
    of several empirical networks. Spectral clustering is a popular and
    computationally feasible method to discover these communities. The Stochastic
    Block Model (Holland et al., 1983) is a social network model with well defined
    communities; each node is a member of one community.

  233. Euclidean Distances, soft and spectral Clustering on Weighted Graphs.

    Authors: Fran&#xe7;ois Bavaud
    Subjects: Machine Learning
    Abstract

    We define a class of Euclidean distances on weighted graphs, enabling to
    perform thermodynamic soft graph clustering. The class can be constructed form
    the "raw coordinates" encountered in spectral clustering, and can be extended
    by means of higher-dimensional embeddings (Schoenberg transformations).
    Geographical flow data, properly conditioned, illustrate the procedure as well
    as visualization aspects.

  234. Discovering Graphical Granger Causality Using the Truncating Lasso Penalty.

    Authors: Ali Shojaie, George Michailidis
    Subjects: Machine Learning
    Abstract

    Components of biological systems interact with each other in order to carry
    out vital cell functions. Such information can be used to improve estimation
    and inference, and to obtain better insights into the underlying cellular
    mechanisms. Discovering regulatory interactions among genes is therefore an
    important problem in systems biology. Whole-genome expression data over time
    provides an opportunity to determine how the expression levels of genes are
    affected by changes in transcription levels of other genes, and can therefore
    be used to discover regulatory interactions among genes.

  235. Minimax Manifold Estimation.

    Authors: Larry Wasserman, Marco Perone-Pacifico, Isabella Verdinelli, Christopher Genovese
    Subjects: Machine Learning
    Abstract

    We find the minimax rate of convergence in Hausdorff distance for estimating
    a manifold M of dimension d embedded in R^D given a noisy sample from the
    manifold. We assume that the manifold satisfies a smoothness condition and that
    the noise distribution has compact support. We show that the optimal rate of
    convergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the
    dimension of the manifold, not on the dimension of the space in which M is
    embedded.

  236. Variable Selection and Dimension Reduction via Sparse Gradients.

    Authors: Gui-Bo Ye, Xiaohui Xie
    Subjects: Machine Learning
    Abstract

    Variable selection and dimension reduction are two commonly adopted
    approaches for high-dimensional data analysis, but have traditionally been
    treated separately. Here we propose an integrated approach, called sparse
    gradient learning (SGL), for simultaneous variable selection and dimension
    reduction via learning the gradient of the prediction function directly from
    samples.

  237. Graph-Valued Regression.

    Authors: Larry Wasserman, Han Liu, John Lafferty, Xi Chen
    Subjects: Machine Learning
    Abstract

    Undirected graphical models encode in a graph $G$ the dependency structure of
    a random vector $Y$. In many applications, it is of interest to model $Y$ given
    another random vector $X$ as input. We refer to the problem of estimating the
    graph $G(x)$ of $Y$ conditioned on $X=x$ as ``graph-valued regression.'' In
    this paper, we propose a semiparametric method for estimating $G(x)$ that
    builds a tree on the $X$ space just as in CART (classification and regression
    trees), but at each leaf of the tree estimates a graph. We call the method
    ``Graph-optimized CART,'' or Go-CART.

  238. Heavy-Tailed Processes for Selective Shrinkage.

    Authors: Michael I. Jordan, Fabian L. Wauthier
    Subjects: Machine Learning
    Abstract

    Heavy-tailed distributions are frequently used to enhance the robustness of
    regression and classification methods to outliers in output space. Often,
    however, we are confronted with ``outliers'' in input space, which are isolated
    observations in sparsely populated regions. We show that heavy-tailed
    stochastic processes (which we construct from Gaussian processes via a copula),
    can be used to improve robustness of regression and classification estimators
    to such outliers by selectively shrinking them more strongly in sparse regions
    than in dense regions.

  239. Decisional States.

    Authors: Nicolas Brodu
    Subjects: Machine Learning
    Abstract

    This article introduces the decisional states of system, and provides a
    practical algorithm for computing them. The decisional states are defined as
    the internal states of a system that lead to the same decision, based on a
    user-provided utility or pay-off function. The utility function encodes some a
    priori knowledge external to the system, it quantifies how bad it is to make
    mistakes. The intrinsic underlying structure of the system is modeled by an
    epsilon-machine and its causal states. The decisional states are the emerging
    patterns corresponding to the utility function.

  240. Gaussian Mixture Modeling with Gaussian Process Latent Variable Models.

    Authors: Hannes Nickisch, Carl Edward Rasmussen
    Subjects: Machine Learning
    Abstract

    Density modeling is notoriously difficult for high dimensional data. One
    approach to the problem is to search for a lower dimensional manifold which
    captures the main characteristics of the data. Recently, the Gaussian Process
    Latent Variable Model (GPLVM) has successfully been used to find low
    dimensional manifolds in a variety of complex data. The GPLVM consists of a set
    of points in a low dimensional latent space, and a stochastic map to the
    observed space. We show how it can be interpreted as a density model in the
    observed space.

  241. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models.

    Authors: Larry Wasserman, Kathryn Roeder, Han Liu
    Subjects: Machine Learning
    Abstract

    A challenging problem in estimating high-dimensional graphical models is to
    choose the regularization parameter in a data-dependent way. The standard
    techniques include $K$-fold cross-validation ($K$-CV), Akaike information
    criterion (AIC), and Bayesian information criterion (BIC). Though these methods
    work well for low-dimensional problems, they are not suitable in high
    dimensional settings. In this paper, we present StARS: a new stability-based
    method for choosing the regularization parameter in high dimensional inference
    for undirected graphs.

  242. Landau Theory of Adaptive Integration in Computational Intelligence.

    Authors: Dariusz Plewczynski
    Subjects: Machine Learning
    Abstract

    Computational Intelligence (CI) is a sub-branch of Artificial Intelligence
    paradigm focusing on the study of adaptive mechanisms to enable or facilitate
    intelligent behavior in complex and changing environments. There are several
    paradigms of CI [like artificial neural networks, evolutionary computations,
    swarm intelligence, artificial immune systems, fuzzy systems and many others],
    each of these has its origins in biological systems [biological neural systems,
    natural Darwinian evolution, social behavior, immune system, interactions of
    organisms with their environment].

  243. C-HiLasso: A Collaborative Hierarchical Sparse Modeling Framework.

    Authors: Guillermo Sapiro, Pablo Sprechmann, Ignacio Ram&#xed;rez, Yonina Eldar
    Subjects: Machine Learning
    Abstract

    Sparse modeling is a powerful framework for data analysis and processing.
    Traditionally, encoding in this framework is performed by solving an
    L1-regularized linear regression problem, commonly referred to as Lasso or
    Basis Pursuit. In this work we combine the sparsity-inducing property of the
    Lasso model at the individual feature level, with the block-sparsity property
    of the Group Lasso model, where sparse groups of features are jointly encoded,
    obtaining a sparsity pattern hierarchically structured.

  244. An Efficient Proximal-Gradient Method for Single and Multi-task Regression with Structured Sparsity.

    Authors: Seyoung Kim, Eric P. Xing, Xi Chen, Qihang Lin, Jaime G. Carbonell, Javier Pe&#xf1;a
    Subjects: Machine Learning
    Abstract

    We consider the optimization problem of learning regression models with a
    mixed-norm penalty that is defined over overlapping groups to achieve
    structured sparsity. It has been previously shown that such penalty can encode
    prior knowledge on the input or output structure to learn an
    structured-sparsity pattern in the regression parameters. However, because of
    the non-separability of the parameters of the overlapping groups, developing an
    efficient optimization method has remained a challenge.

  245. Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso.

    Authors: Seyoung Kim, Eric P. Xing, Xi Chen, Qihang Lin, Jaime G. Carbonell
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning a structured multi-task regression, where
    the output consists of multiple responses that are related by a graph and the
    correlated response variables are dependent on the common inputs in a sparse
    but synergistic manner. Previous methods such as l1/l2-regularized multi-task
    regression assume that all of the output variables are equally related to the
    inputs, although in many real-world problems, outputs are related in a complex
    manner.

  246. Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets.

    Authors: Fionn Murtagh, Pedro Contreras
    Subjects: Machine Learning
    Abstract

    Data analysis and data mining are concerned with unsupervised pattern finding
    and structure determination in data sets. "Structure" can be understood as
    symmetry and a range of symmetries are expressed by hierarchy. Such symmetries
    directly point to invariants, that pinpoint intrinsic properties of the data
    and of the background empirical domain of interest. We review many aspects of
    hierarchy here, including ultrametric topology, generalized ultrametric,
    linkages with lattices and other discrete algebraic structures and with p-adic
    number representations.

  247. Context models on sequences of covers.

    Authors: Christos Dimitrakakis
    Subjects: Machine Learning
    Abstract

    We consider estimation of a class of context models that can approximate
    partially observable Markov processes. This class is related to the context
    tree weighting algorithm for discrete sequence prediction (Willems et al.,
    1995). We present a constructive definition of a context process which extends
    the one proposed in (Dimitrakakis, 2010) for the estimation of variable order
    Markov models.

  248. Refinements of Universal Approximation Results for Deep Belief Networks and Restricted Boltzmann Machines.

    Authors: Nihat Ay, Guido Montufar
    Subjects: Machine Learning
    Abstract

    We improve recently published results about resources of Restricted Boltzmann
    Machines (RBM) and Deep Belief Networks (DBN) required to make them Universal
    Approximators. We show that any distribution p on the set of binary vectors of
    length n can be arbitrarily well approximated by an RBM with k-1 hidden units,
    where k is the minimal number of pairs of binary vectors differing in only one
    entry such that their union contains the support set of p. In important cases
    this number is half of the cardinality of the support set of p.

  249. Improving the Johnson-Lindenstrauss Lemma.

    Authors: Javier Rojo, Tuan Nguyen
    Subjects: Machine Learning
    Abstract

    The Johnson-Lindenstrauss Lemma allows for the projection of $n$ points in
    $p-$dimensional Euclidean space onto a $k-$dimensional Euclidean space, with $k
    \ge \frac{24\ln \emph{n}}{3\epsilon^2-2\epsilon^3}$, so that the pairwise
    distances are preserved within a factor of $1\pm\epsilon$. Here, working
    directly with the distributions of the random distances rather than resorting
    to the moment generating function technique, an improvement on the lower bound
    for $k$ is obtained.

  250. Training linear ranking SVMs in linearithmic time using red-black trees.

    Authors: Tapio Pahikkala, Antti Airola, Tapio Salakoski
    Subjects: Machine Learning
    Abstract

    We introduce an efficient method for training the linear ranking support
    vector machine. The method combines cutting plane optimization with red-black
    tree based approach to subgradient calculations, and has O(ms+m*log(m)) time
    complexity, where m is the number of training examples, and s the average
    number of non-zero features per example. Best previously known training
    algorithms achieve the same efficiency only for restricted special cases,
    whereas the proposed approach allows any real valued utility scores in the
    training data.

  251. Introduction to Graphical Modelling.

    Authors: Marco Scutari, Korbinian Strimmer
    Subjects: Machine Learning
    Abstract

    The aim of this chapter is twofold. In the first part we will provide a brief
    overview of the mathematical and statistical foundations of graphical models,
    along with their fundamental properties, estimation and basic inference
    procedures. In particular we will develop Markov networks (also known as Markov
    random fields) and Bayesian networks, which comprise most past and current
    literature on graphical models. In the second part we will review some
    applications of graphical models in systems biology.

  252. A Unifying View of Multiple Kernel Learning.

    Authors: Peter L. Bartlett, Marius Kloft, Ulrich R&#xfc;ckert
    Subjects: Machine Learning
    Abstract

    Recent research on multiple kernel learning has lead to a number of
    approaches for combining kernels in regularized risk minimization. The proposed
    approaches include different formulations of objectives and varying
    regularization strategies. In this paper we present a unifying general
    optimization criterion for multiple kernel learning and show how existing
    formulations are subsumed as special cases.

  253. Large Margin Multiclass Gaussian Classification with Differential Privacy.

    Authors: Manas A. Pathak, Bhiksha Raj
    Subjects: Machine Learning
    Abstract

    As increasing amounts of sensitive personal information finds its way into
    data repositories, it has become important to develop analysis mechanisms that
    can derive aggregate information from these repositories without revealing
    information about individual data instances. The $\epsilon$-differential
    privacy model provides a framework that facilitates development and analysis of
    such mechanisms.

  254. Analysis of a Random Forests Model.

    Authors: G&#xe9;rard Biau
    Subjects: Machine Learning
    Abstract

    Random forests are a scheme proposed by Leo Breiman in the 00's for building
    a predictor ensemble with a set of decision trees that grow in randomly
    selected subspaces of data. Despite growing interest and practical use, there
    has been little exploration of the statistical properties of random forests,
    and little is known about the mathematical forces driving the algorithm. In
    this paper, we offer an in-depth analysis of a random forests model suggested
    by Breiman in 2004, which is very close to the original algorithm.

  255. Sparse Linear Identifiable Multivariate Modeling.

    Authors: Ricardo Henao, Ole Winther
    Subjects: Machine Learning
    Abstract

    In this paper we consider sparse and identifiable linear latent variable
    (factor) and linear Bayesian network models for parsimonious analysis of
    multivariate data. We propose a computationally efficient method for joint
    parameter and model inference, and model comparison. It consists of a fully
    Bayesian hierarchy for sparse models using slab and spike priors (two-component
    delta and continuous mixtures), non-Gaussian latent factors and a stochastic
    search over the ordering of the variables.

  256. Many-to-Many Graph Matching: a Continuous Relaxation Approach.

    Authors: Francis Bach, Jean-Philippe Vert, Mikhail Zaslavskiy
    Subjects: Machine Learning
    Abstract

    Graphs provide an efficient tool for object representation in various
    computer vision applications. Once graph-based representations are constructed,
    an important question is how to compare graphs. This problem is often
    formulated as a graph matching problem where one seeks a mapping between
    vertices of two graphs which optimally aligns their structure. In the classical
    formulation of graph matching, only one-to-one correspondences between vertices
    are considered.

  257. Supervised classification for a family of Gaussian functional models.

    Authors: Amparo Ba&#xed;llo, Juan Antonio Cuesta-Albertos, Antonio Cuevas
    Subjects: Machine Learning
    Abstract

    In the framework of supervised classification (discrimination) for functional
    data, it is shown that the optimal classification rule can be explicitly
    obtained for a class of Gaussian processes with "triangular" covariance
    functions. This explicit knowledge has two practical consequences. First, the
    consistency of the well-known nearest neighbors classifier (which is not
    guaranteed in the problems with functional data) is established for the
    indicated class of processes.

  258. Strong Consistency of Prototype Based Clustering in Probabilistic Space.

    Authors: Vladimir Nikulin, Geoffrey J. McLachlan
    Subjects: Machine Learning
    Abstract

    In this paper we formulate in general terms an approach to prove strong
    consistency of the Empirical Risk Minimisation inductive principle applied to
    the prototype or distance based clustering. This approach was motivated by the
    Divisive Information-Theoretic Feature Clustering model in probabilistic space
    with Kullback-Leibler divergence which may be regarded as a special case within
    the Clustering Minimisation framework.

  259. Spatio-Temporal Graphical Model Selection.

    Authors: Patrick L. Harrington Jr., Alfred O. Hero III
    Subjects: Machine Learning
    Abstract

    We consider the problem of estimating the topology of spatial interactions in
    a discrete state, discrete time spatio-temporal graphical model where the
    interactions affect the temporal evolution of each agent in a network. Among
    other models, the susceptible, infected, recovered ($SIR$) model for
    interaction events fall into this framework. We pose the problem as a structure
    learning problem and solve it using an $\ell_1$-penalized likelihood convex
    program. We evaluate the solution on a simulated spread of infectious over a
    complex network.

  260. Exploratory Analysis of Functional Data via Clustering and Optimal Segmentation.

    Authors: Fabrice Rossi, Georges H&#xe9;brail, Bernard Hugueney, Yves Lechevallier
    Subjects: Machine Learning
    Abstract

    We propose in this paper an exploratory analysis algorithm for functional
    data. The method partitions a set of functions into $K$ clusters and represents
    each cluster by a simple prototype (e.g., piecewise constant). The total number
    of segments in the prototypes, $P$, is chosen by the user and optimally
    distributed among the clusters via two dynamic programming algorithms. The
    practical relevance of the method is shown on two real world datasets.

  261. Visualization of Manifold-Valued Elements by Multidimensional Scaling.

    Authors: Simone Fiori
    Subjects: Machine Learning
    Abstract

    The present contribution suggests the use of a multidimensional scaling (MDS)
    algorithm as a visualization tool for manifold-valued elements. A visualization
    tool of this kind is useful in signal processing and machine learning whenever
    learning/adaptation algorithms insist on high-dimensional parameter manifolds.

  262. On the Schoenberg Transformations in Data Analysis: Theory and Illustrations.

    Authors: Fran&#xe7;ois Bavaud
    Subjects: Machine Learning
    Abstract

    The class of Schoenberg transformations, embedding Euclidean distances into
    higher dimensional Euclidean spaces, is presented, and derived from theorems on
    positive definite and conditionally negative definite matrices. Original
    results on the arc lengths, angles and curvature of the transformations are
    proposed, and visualized on artificial data sets by classical multidimensional
    scaling. A simple distance-based discriminant algorithm illustrates the theory,
    intimately connected to the Gaussian kernels of Machine Learning.

  263. Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes.

    Authors: Ryan Prescott Adams, Iain Murray, George E. Dahl
    Subjects: Machine Learning
    Abstract

    Probabilistic matrix factorization (PMF) is a powerful method for modeling
    data associated with pairwise relationships, finding use in collaborative
    filtering, computational biology, and document analysis, among other areas. In
    many domains, there is additional information that can assist in prediction.
    For example, when modeling movie ratings, we might know when the rating
    occurred, where the user lives, or what actors appear in the movie. It is
    difficult, however, to incorporate this side information into the PMF model.

  264. Resolution and Scale Independent Function Matching Using a String Energy Penalized Spline Prior.

    Authors: David M. Rogers, Thomas L. Beck
    Subjects: Machine Learning
    Abstract

    The extension of the classical Bayesian penalized spline method to inference
    on vector-valued functions is considered, with an emphasis on characterizing
    the suitability of the method for general application.We show that the standard
    quadratic penalty is exactly analogous to the energy of a stretched string,
    with the penalty parameter corresponding to its tension.

  265. Linear Time Feature Selection for Regularized Least-Squares.

    Authors: Tapio Pahikkala, Antti Airola, Tapio Salakoski
    Subjects: Machine Learning
    Abstract

    We propose a novel algorithm for greedy forward feature selection for
    regularized least-squares (RLS) regression and classification, also known as
    the least-squares support vector machine or ridge regression. The algorithm,
    which we call greedy RLS, starts from the empty feature set, and on each
    iteration adds the feature whose addition provides the best leave-one-out
    cross-validation performance.

  266. The Directed Closure Process in Hybrid Social-Information Networks, with an Analysis of Link Formation on Twitter.

    Authors: Jon Kleinberg, Daniel M. Romero
    Subjects: Machine Learning
    Abstract

    It has often been taken as a working assumption that directed links in
    information networks are frequently formed by "short-cutting" a two-step path
    between the source and the destination -- a kind of implicit "link copying"
    analogous to the process of triadic closure in social networks. Despite the
    role of this assumption in theoretical models such as preferential attachment,
    it has received very little direct empirical investigation.

  267. Optimal Allocation Strategies for the Dark Pool Problem.

    Authors: Alekh Agarwal, Peter Bartlett, Max Dama
    Subjects: Machine Learning
    Abstract

    We study the problem of allocating stocks to dark pools. We propose and
    analyze an optimal approach for allocations, if continuous-valued allocations
    are allowed. We also propose a modification for the case when only
    integer-valued allocations are possible. We extend the previous work on this
    problem to adversarial scenarios, while also improving on their results in the
    iid setup. The resulting algorithms are efficient, and perform well in
    simulations under stochastic and adversarial inputs.

  268. Estimation of R\'enyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs.

    Authors: D&#xe1;vid P&#xe1;l, Barnab&#xe1;s P&#xf3;czos, Csaba Szepesv&#xe1;ri
    Subjects: Machine Learning
    Abstract

    In this paper we consider simple and computationally efficient nonparametric
    estimators of R\'enyi entropy and mutual information based on an i.i.d. sample
    drawn from an unknown, absolutely continuous distribution over $\R^d$.
    Following previous works, the estimators are calculated as the sum of $p$-th
    powers of the Euclidean lengths of the edges of the `generalized
    nearest-neighbor' graph of the sample and the empirical copula of the sample
    respectively. Under mild conditions we prove the almost sure consistency of the
    estimators.

  269. Supervised Topic Models.

    Authors: David M. Blei, Jon D. McAuliffe
    Subjects: Machine Learning
    Abstract

    We introduce supervised latent Dirichlet allocation (sLDA), a statistical
    model of labelled documents. The model accommodates a variety of response
    types. We derive an approximate maximum-likelihood procedure for parameter
    estimation, which relies on variational methods to handle intractable posterior
    expectations. Prediction problems motivate this research: we use the fitted
    model to predict response values for new documents. We test sLDA on two
    real-world problems: movie ratings predicted from reviews, and the political
    tone of amendments in the U.S. Senate based on the amendment text.

  270. Universality, Characteristic Kernels and RKHS Embedding of Measures.

    Authors: Kenji Fukumizu, Bharath K. Sriperumbudur, Gert R. G. Lanckriet
    Subjects: Machine Learning
    Abstract

    A Hilbert space embedding for probability measures has recently been
    proposed, wherein any probability measure is represented as a mean element in a
    reproducing kernel Hilbert space (RKHS). Such an embedding has found
    applications in homogeneity testing, independence testing, dimensionality
    reduction, etc., with the requirement that the reproducing kernel is
    characteristic, i.e., the embedding is injective.

  271. Comment on "Fastest learning in small-world neural networks".

    Authors: Z.X. Guo
    Subjects: Machine Learning
    Abstract

    This comment reexamines Simard et al.'s work in [D. Simard, L. Nadeau, H.
    Kroger, Phys. Lett. A 336 (2005) 8-15]. We found that Simard et al. calculated
    mistakenly the local connectivity lengths Dlocal of networks. The right results
    of Dlocal are presented and the supervised learning performance of feedforward
    neural networks (FNNs) with different rewirings are re-investigated in this
    comment.

  272. Security Analysis of Online Centroid Anomaly Detection.

    Authors: Marius Kloft, Pavel Laskov
    Subjects: Machine Learning
    Abstract

    Security issues are crucial in a number of machine learning applications,
    especially in scenarios dealing with human activity rather than natural
    phenomena (e.g., information ranking, spam detection, malware detection, etc.).
    It is to be expected in such cases that learning algorithms will have to deal
    with manipulated data aimed at hampering decision making. Although some
    previous work addressed the handling of malicious data in the context of
    supervised learning, very little is known about the behavior of anomaly
    detection methods in such scenarios.

  273. Principal Component Analysis with Contaminated Data: The High Dimensional Case.

    Authors: Shie Mannor, Huan Xu, Constantine Caramanis
    Subjects: Machine Learning
    Abstract

    We consider the dimensionality-reduction problem (finding a subspace
    approximation of observed data) for contaminated data in the high dimensional
    regime, where the number of observations is of the same magnitude as the number
    of variables of each observation, and the data set contains some (arbitrarily)
    corrupted observations. We propose a High-dimensional Robust Principal
    Component Analysis (HR-PCA) algorithm that is tractable, robust to contaminated
    points, and easily kernelizable.

  274. A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning.

    Authors: S.V.N. Vishwanathan, Jin Yu, Simon Guenter, Nicol N. Schraudolph
    Subjects: Machine Learning
    Abstract

    We extend the well-known BFGS quasi-Newton method and its memory-limited
    variant LBFGS to the optimization of nonsmooth convex objectives. This is done
    in a rigorous fashion by generalizing three components of BFGS to
    subdifferentials: the local quadratic model, the identification of a descent
    direction, and the Wolfe line search conditions. We prove that under some
    technical conditions, the resulting subBFGS algorithm is globally convergent in
    objective function value. We apply its memory-limited variant (subLBFGS) to
    L_2-regularized risk minimization with the binary hinge loss.

  275. Query Learning with Exponential Query Costs.

    Authors: Clayton Scott, Gowtham Bellala, Suresh Bhavnani
    Subjects: Machine Learning
    Abstract

    In query learning, the goal is to identify an unknown object while minimizing
    the number of "yes" or "no" questions (queries) posed about that object. A
    well-studied algorithm for query learning is known as generalized binary search
    (GBS). We show that GBS is a greedy algorithm to optimize the expected number
    of queries needed to identify the unknown object. We also generalize GBS in two
    ways. First, we consider the case where the cost of querying grows
    exponentially in the number of queries and the goal is to minimize the expected
    exponential cost.

  276. Robust Independent Component Analysis by Iterative Maximization of the Kurtosis Contrast with Algebraic Optimal Step Size.

    Authors: Pierre Comon, Vicente Zarzoso
    Subjects: Machine Learning
    Abstract

    Independent component analysis (ICA) aims at decomposing an observed random
    vector into statistically independent variables. Deflation-based
    implementations, such as the popular one-unit FastICA algorithm and its
    variants, extract the independent components one after another. A novel method
    for deflationary ICA, referred to as RobustICA, is put forward in this paper.
    This simple technique consists of performing exact line search optimization of
    the kurtosis contrast function.

  277. Plugin procedure in segmentation and application to hyperspectral image segmentation.

    Authors: R. Girard
    Subjects: Machine Learning
    Abstract

    In this article we give our contribution to the problem of segmentation with
    plug-in procedures. We give general sufficient conditions under which plug in
    procedure are efficient. We also give an algorithm that satisfy these
    conditions. We give an application of the used algorithm to hyperspectral
    images segmentation. Hyperspectral images are images that have both spatial and
    spectral coherence with thousands of spectral bands on each pixel. In the
    proposed procedure we combine a reduction dimension technique and a spatial
    regularisation technique.

  278. Ultrahigh dimensional variable selection for Cox's proportional hazards model.

    Authors: Jianqing Fan, Yichao Wu, Yang Feng
    Subjects: Machine Learning
    Abstract

    Variable selection in high dimensional space has challenged many contemporary
    statistical problems from many frontiers of scientific disciplines. Recent
    technology advance has made it possible to collect a huge amount of covariate
    information such as microarray, proteomic and SNP data via bioimaging
    technology while observing survival information on patients in clinical
    studies. Thus, the same challenge applies to the survival analysis in order to
    understand the association between genomics information and clinical
    information about the survival time.

  279. Online Learning for Matrix Factorization and Sparse Coding.

    Authors: Francis Bach, Julien Mairal, Jean Ponce, Guillermo Sapiro
    Subjects: Machine Learning
    Abstract

    Sparse coding--that is, modelling data vectors as sparse linear combinations
    of basis elements--is widely used in machine learning, neuroscience, signal
    processing, and statistics. This paper focuses on the large-scale matrix
    factorization problem that consists of learning the basis set, adapting it to
    specific data. Variations of this problem include dictionary learning in signal
    processing, non-negative matrix factorization and sparse principal component
    analysis.

  280. Operator norm convergence of spectral clustering on level sets.

    Authors: Bruno Pelletier, Pierre Pudlo
    Subjects: Machine Learning
    Abstract

    Following Hartigan, a cluster is defined as a connected component of the
    t-level set of the underlying density, i.e., the set of points for which the
    density is greater than t. A clustering algorithm which combines a density
    estimate with spectral clustering techniques is proposed. Our algorithm is
    composed of two steps. First, a nonparametric density estimate is used to
    extract the data points for which the estimated density takes a value greater
    than t. Next, the extracted points are clustered based on the eigenvectors of a
    graph Laplacian matrix.

  281. Probabilistic Recovery of Multiple Subspaces in Point Clouds by Geometric lp Minimization.

    Authors: Teng Zhang, Gilad Lerman
    Subjects: Machine Learning
    Abstract

    We assume data independently sampled from a mixture distribution on the unit
    ball of the D-dimensional Euclidean space with K+1 components: the first
    component is a uniform distribution on that ball representing outliers and the
    other K components are uniform distributions along K d-dimensional linear
    subspaces restricted to that ball.

  282. Manifold-Based Signal Recovery and Parameter Estimation from Compressive Measurements.

    Authors: Michael B. Wakin
    Subjects: Machine Learning
    Abstract

    A field known as Compressive Sensing (CS) has recently emerged to help
    address the growing challenges of capturing and processing high-dimensional
    signals and data sets. CS exploits the surprising fact that the information
    contained in a sparse signal can be preserved in a small number of compressive
    (or random) linear measurements of that signal. Strong theoretical guarantees
    have been established on the accuracy to which sparse or near-sparse signals
    can be recovered from noisy compressive measurements.

  283. K-Dimensional Coding Schemes in Hilbert Spaces.

    Authors: Andreas Maurer Massimiliano Pontil
    Subjects: Machine Learning
    Abstract

    This paper presents a general coding method where data in a Hilbert space are
    represented by finite dimensional coding vectors. The method is based on
    empirical risk minimization within a certain class of linear operators, which
    map the set of coding vectors to the Hilbert space. Two results bounding the
    expected reconstruction error of the method are derived, which highlight the
    role played by the codebook and the class of linear operators.

  284. Classifying the typefaces of the Gutenberg 42-line bible.

    Authors: Aureli Alabert, Luz Ma. Rangel
    Subjects: Machine Learning
    Abstract

    We have measured the dissimilarities among several printed characters of a
    single page in the Gutenberg 42-line bible and we prove statistically the
    existence of several different matrices from which the metal types where
    constructed. This is in contrast with the prevailing theory, which states that
    only one matrix per character was used in the printing process of Gutenberg's
    greatest work.

  285. Classifying Network Data with Deep Kernel Machines.

    Authors: Xiao Tang, Mu Zhu
    Subjects: Machine Learning
    Abstract

    Inspired by a growing interest in analyzing network data, we study the
    problem of node classification on graphs, focusing on approaches based on
    kernel machines. Conventionally, kernel machines are linear classifiers in the
    implicit feature space. We argue that linear classification in the feature
    space of kernels commonly used for graphs is often not enough to produce good
    results. When this is the case, one naturally considers nonlinear classifiers
    in the feature space.

  286. Bayesian Inference in Queueing Networks.

    Authors: Michael I. Jordan, Charles Sutton
    Subjects: Machine Learning
    Abstract

    Modern Web services, such as those at Google, Yahoo!, and Amazon, handle
    billions of requests per day on clusters of thousands of computers. Because
    these services operate under strict performance requirements, a statistical
    understanding of their performance is of great practical interest. Such
    services are modeled by networks of queues, where one queue models each of the
    individual computers in the system. A key challenge is that the data is
    incomplete, because recording detailed information about every request to a
    heavily used system can require unacceptable overhead.

  287. Increasing stability and interpretability of gene expression signatures.

    Authors: Anne-Claire Haury, Laurent Jacob, Jean-Philippe Vert
    Subjects: Machine Learning
    Abstract

    Motivation : Molecular signatures for diagnosis or prognosis estimated from
    large-scale gene expression data often lack robustness and stability, rendering
    their biological interpretation challenging. Increasing the signature's
    interpretability and stability across perturbations of a given dataset and, if
    possible, across datasets, is urgently needed to ease the discovery of
    important biological processes and, eventually, new drug targets. Results : We
    propose a new method to construct signatures with increased stability and
    easier interpretability.

  288. Sparsity-accuracy trade-off in MKL.

    Authors: Taiji Suzuki, Ryota Tomioka
    Subjects: Machine Learning
    Abstract

    We empirically investigate the best trade-off between sparse and
    uniformly-weighted multiple kernel learning (MKL) using the elastic-net
    regularization on real and simulated datasets. We find that the best trade-off
    parameter depends not only on the sparsity of the true kernel-weight spectrum
    but also on the linear dependence among kernels and the number of samples.

  289. Bayesian reduced-order models for multiscale dynamical systems.

    Authors: P.S. Koutsourelakis, Elias Bilionis
    Subjects: Machine Learning
    Abstract

    While existing mathematical descriptions can accurately account for phenomena
    at microscopic scales (e.g. molecular dynamics), these are often
    high-dimensional, stochastic and their applicability over macroscopic time
    scales of physical interest is computationally infeasible or impractical. In
    complex systems, with limited physical insight on the coherent behavior of
    their constituents, the only available information is data obtained from
    simulations of the trajectories of huge numbers of degrees of freedom over
    microscopic time scales.

  290. Tree Density Estimation.

    Authors: Larry Wasserman, Han Liu, John Lafferty
    Subjects: Machine Learning
    Abstract

    We study graph estimation and density estimation in high dimensions. To avoid
    the curse of dimensionality, we consider a family of density estimators based
    on tree structured undirected graphical models. We do not assume the true
    distribution corresponds to a tree; rather, we try to find the best tree-based
    approximation to the true distribution. We apply the Chow-Liu algorithm to
    kernel density estimates to build a tree and then use a data-splitting scheme
    to choose the number of edges. We also prove oracle properties on both function
    estimation and structure learning.

  291. Spectral clustering based on local linear approximations.

    Authors: Ery Arias-Castro, Gilad Lerman, Guangliang Chen
    Subjects: Machine Learning
    Abstract

    In the context of clustering, we assume a generative model where each cluster
    is the result of sampling points in the neighborhood of an embedded smooth
    surface, possibly contaminated with outliers. We consider a prototype for a
    higher-order spectral clustering method based on the residual from a local
    linear approximation.

  292. Learning the Structure of Deep Sparse Graphical Models.

    Authors: Ryan Prescott Adams, Zoubin Ghahramani, Hanna M. Wallach
    Subjects: Machine Learning
    Abstract

    Deep belief networks are a powerful way to model complex probability
    distributions. However, learning the structure of a belief network,
    particularly one with hidden units, is difficult. The Indian buffet process has
    been used as a nonparametric Bayesian prior on the directed structure of a
    belief network with a single infinitely wide hidden layer. In this paper, we
    introduce the cascading Indian buffet process (CIBP), which provides a
    nonparametric prior on the structure of a layered, directed belief network that
    is unbounded in both depth and width, yet allows tractable inference.

  293. Regularization for Matrix Completion.

    Authors: Raghunandan H. Keshavan, Andrea Montanari
    Subjects: Machine Learning
    Abstract

    We consider the problem of reconstructing a low rank matrix from noisy
    observations of a subset of its entries. This task has applications in
    statistical learning, computer vision, and signal processing. In these
    contexts, "noise" generically refers to any contribution to the data that is
    not captured by the low-rank model. In most applications, the noise level is
    large compared to the underlying signal and it is important to avoid
    overfitting. In order to tackle this problem, we define a regularized cost
    function well suited for spectral reconstruction methods.

  294. MedLDA: A General Framework of Maximum Margin Supervised Topic Models.

    Authors: Jun Zhu, Amr Ahmed, Eric P. Xing
    Subjects: Machine Learning
    Abstract

    Supervised topic models utilize document's side information for discovering
    predictive low dimensional representations of documents. Existing models apply
    the likelihood-based estimation. In this paper, we present a general framework
    of max-margin supervised topic models for both continuous and categorical
    response variables. Our approach, the maximum entropy discrimination latent
    Dirichlet allocation (MedLDA), utilizes the max-margin principle to train
    supervised topic models and estimate predictive topic representations that are
    arguably more suitable for prediction tasks.

  295. A Geometric Proof of Calibration.

    Authors: Shie Mannor, Gilles Stoltz
    Subjects: Machine Learning
    Abstract

    We provide yet another proof of the existence of calibrated forecasters; it
    has two merits. First, it is valid for an arbitrary finite number of outcomes.
    Second, it is short and simple and it follows from a direct application of
    Blackwell's approachability theorem to carefully chosen vector-valued payoff
    function and convex target set. Our proof captures the essence of existing
    proofs based on approachability (e.g., the proof by Foster, 1999 in case of
    binary outcomes) and highlights the intrinsic connection between
    approachability and calibration.

  296. Optimal construction of k-nearest neighbor graphs for identifying noisy clusters.

    Authors: Markus Maier, Matthias Hein, Ulrike von Luxburg
    Subjects: Machine Learning
    Abstract

    We study clustering algorithms based on neighborhood graphs on a random
    sample of data points. The question we ask is how such a graph should be
    constructed in order to obtain optimal clustering results. Which type of
    neighborhood graph should one choose, mutual k-nearest neighbor or symmetric
    k-nearest neighbor? What is the optimal parameter k? In our setting, clusters
    are defined as connected components of the t-level set of the underlying
    probability distribution.

  297. Composite Binary Losses.

    Authors: Mark D. Reid, Robert C. Williamson
    Subjects: Machine Learning
    Abstract

    We study losses for binary classification and class probability estimation
    and extend the understanding of them from margin losses to general composite
    losses which are the composition of a proper loss with a link function. We
    characterise when margin losses can be proper composite losses, explicitly show
    how to determine a symmetric loss in full from half of one of its partial
    losses, introduce an intrinsic parametrisation of composite binary losses and
    give a complete characterisation of the relationship between proper losses and
    ``classification calibrated'' losses.

  298. Variational Inducing Kernels for Sparse Convolved Multiple Output Gaussian Processes.

    Authors: Mauricio A. &#xc1;lvarez, Neil D. Lawrence, David Luengo, Michalis K. Titsias
    Subjects: Machine Learning
    Abstract

    Interest in multioutput kernel methods is increasing, whether under the guise
    of multitask learning, multisensor networks or structured output data. From the
    Gaussian process perspective a multioutput Mercer kernel is a covariance
    function over correlated output functions. One way of constructing such kernels
    is based on convolution processes (CP). A key problem for this approach is
    efficient inference. Alvarez and Lawrence (2009) recently presented a sparse
    approximation for CPs that enabled efficient inference.

  299. Multi-Way, Multi-View Learning.

    Authors: Ilkka Huopaniemi, Tommi Suvitaival, Janne Nikkil&#xe4;, Matej Ore&#x161;i&#x10d;, Samuel Kaski
    Subjects: Machine Learning
    Abstract

    We extend multi-way, multivariate ANOVA-type analysis to cases where one
    covariate is the view, with features of each view coming from different,
    high-dimensional domains. The different views are assumed to be connected by
    having paired samples; this is a common setup in recent bioinformatics
    experiments, of which we analyze metabolite profiles in different conditions
    (disease vs. control and treatment vs.

  300. Condition Number Analysis of Kernel-based Density Ratio Estimation.

    Authors: Taiji Suzuki, Masashi Sugiyama, Takafumi Kanamori
    Subjects: Machine Learning
    Abstract

    The ratio of two probability densities can be used for solving various
    machine learning tasks such as covariate shift adaptation (importance
    sampling), outlier detection (likelihood-ratio test), and feature selection
    (mutual information).

  301. Under-determined reverberant audio source separation using a full-rank spatial covariance model.

    Authors: Ngoc Duong, Emmanuel Vincent, Remi Gribonval
    Subjects: Machine Learning
    Abstract

    This article addresses the modeling of reverberant recording environments in
    the context of under-determined convolutive blind source separation. We model
    the contribution of each source to all mixture channels in the time-frequency
    domain as a zero-mean Gaussian random variable whose covariance encodes the
    spatial characteristics of the source. We then consider four specific
    covariance models, including a full-rank unconstrained model.

  302. Learning an Interactive Segmentation System.

    Authors: Hannes Nickisch, Pushmeet Kohli, Carsten Rother
    Subjects: Machine Learning
    Abstract

    Many successful applications of computer vision to image or video
    manipulation are interactive by nature. However, parameters of such systems are
    often trained neglecting the user. Traditionally, interactive systems have been
    treated in the same manner as their fully automatic counterparts. Their
    performance is evaluated by computing the accuracy of their solutions under
    some fixed set of user interactions. This paper proposes a new evaluation and
    learning method which brings the user in the loop. It is based on the use of an
    active robot user - a simulated model of a human user.

  303. Hyper-sparse optimal aggregation.

    Authors: St&#xe9;phane Ga&#xef;ffas, Guillaume Lecu&#xe9;
    Subjects: Machine Learning
    Abstract

    In this paper, we consider the problem of "hyper-sparse aggregation". Namely,
    given a dictionary $F = \{f_1, ..., f_M \}$ of functions, we look for an
    optimal aggregation algorithm that writes $\tilde f = \sum_{j=1}^M \theta_j
    f_j$ with as many zero coefficients $\theta_j$ as possible. This problem is of
    particular interest when $F$ contains many irrelevant functions that should not
    appear in $\tilde{f}$. We provide an exact oracle inequality for $\tilde f$,
    where only two coefficients are non-zero, that entails $\tilde f$ to be an
    optimal aggregation algorithm.

  304. On the numeric stability of the SFA implementation sfa-tk.

    Authors: Wolfgang Konen
    Subjects: Machine Learning
    Abstract

    Slow feature analysis (SFA) is a method for extracting slowly varying
    features from a quickly varying multidimensional signal. An open source
    Matlab-implementation sfa-tk makes SFA easily useable. We show here that under
    certain circumstances, namely when the covariance matrix of the nonlinearly
    expanded data does not have full rank, this implementation runs into numerical
    instabilities. We propse a modified algorithm based on singular value
    decomposition (SVD) which is free of those instabilities even in the case where
    the rank of the matrix is only less than 10% of its size.

  305. How to Explain Individual Classification Decisions.

    Authors: David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, Klaus-Robert Mueller
    Subjects: Machine Learning
    Abstract

    After building a classifier with modern tools of machine learning we
    typically have a black box at hand that is able to predict well for unseen
    data. Thus, we get an answer to the question what is the most likely label of a
    given unseen data point. However, most methods will provide no answer why the
    model predicted the particular label for a single instance and what features
    were most influential for that particular instance. The only method that is
    currently able to provide such explanations are decision trees.

  306. Qualitative Robustness of Support Vector Machines.

    Authors: Robert Hable, Andreas Christmann
    Subjects: Machine Learning
    Abstract

    Support vector machines have attracted much attention in theoretical and in
    applied statistics. Main topics of recent interest are consistency, learning
    rates and robustness. In this article, it is shown that support vector machines
    are qualitatively robust. Since support vector machines can be represented by a
    functional on the set of all probability measures, qualitative robustness is
    proven by showing that this functional is continuous with respect to the
    topology generated by weak convergence of probability measures.

  307. Thresholding-based Iterative Selection Procedures for Generalized Linear Models.

    Authors: Yiyuan She
    Subjects: Machine Learning
    Abstract

    High-dimensional correlated data pose challenges in model selection and
    predictive learning. Recent work by She (2009) was among the first to advocate
    the use of nonconvex methods to meet this challenge. In this paper, we
    generalize the thresholding-based iterative selection procedures (TISPs) to
    generalized linear models (GLMs). Somewhat different than other works, the
    launching point here is thresholding rules rather than penalty functions. We
    successfully establish a universal connection between TISPs and penalized
    likelihoods.

  308. Positive Definite Kernels in Machine Learning.

    Authors: Marco Cuturi
    Subjects: Machine Learning
    Abstract

    This survey is an introduction to positive definite kernels and the set of
    methods they have inspired in the machine learning literature, namely kernel
    methods. We first discuss some properties of positive definite kernels as well
    as reproducing kernel Hibert spaces, the natural extension of the set of
    functions $\{k(x,\cdot),x\in\mathcal{X}\}$ associated with a kernel $k$ defined
    on a space $\mathcal{X}$. We discuss at length the construction of kernel
    functions that take advantage of well-known statistical models.

  309. Penalized Likelihood Methods for Estimation of Sparse High Dimensional Directed Acyclic Graphs.

    Authors: Ali Shojaie, George Michailidis
    Subjects: Machine Learning
    Abstract

    Directed acyclic graphs (DAGs) are commonly used to represent causal
    relationships among random variables in graphical models. Applications of these
    models arise in the study of physical, as well as biological systems, where
    directed edges between nodes represent the influence of components of the
    system on each other. The general problem of estimating DAGs from observed data
    is computationally NP-hard, Moreover two directed graphs may be observationally
    equivalent.

  310. Sparse Empirical Bayes Analysis (SEBA).

    Authors: Ya&#x27;Acov Ritov, Natalia Bochkina
    Subjects: Machine Learning
    Abstract

    We consider a joint processing of $n$ independent sparse regression problems.
    Each is based on a sample $(y_{i1},x_{i1})...,(y_{im},x_{im})$ of $m$ \iid
    observations from $y_{i1}=x_{i1}\t\beta_i+\eps_{i1}$, $y_{i1}\in \R$, $x_{i
    1}\in\R^p$, $i=1,...,n$, and $\eps_{i1}\dist N(0,\sig^2)$, say. $p$ is large
    enough so that the empirical risk minimizer is not consistent. We consider
    three possible extensions of the lasso estimator to deal with this problem, the
    lassoes, the group lasso and the RING lasso, each utilizing a different
    assumption how these problems are related.

  311. Sparse Convolved Multiple Output Gaussian Processes.

    Authors: Mauricio A. &#xc1;lvarez, Neil D. Lawrence
    Subjects: Machine Learning
    Abstract

    Recently there has been an increasing interest in methods that deal with
    multiple outputs. This has been motivated partly by frameworks like multitask
    learning, multisensor networks or structured output data. From a Gaussian
    processes perspective, the problem reduces to specifying an appropriate
    covariance function that, whilst being positive semi-definite, captures the
    dependencies between all the data points and across all the outputs. One
    approach to account for non-trivial correlations between outputs employs
    convolution processes.

  312. Group-based Query Learning for rapid diagnosis in time-critical situations.

    Authors: Clayton Scott, Gowtham Bellala, Suresh Bhavnani
    Subjects: Machine Learning
    Abstract

    In query learning, the goal is to identify an unknown object while minimizing
    the number of "yes or no" questions (queries) posed about that object. We
    consider three extensions of this fundamental problem that are motivated by
    practical considerations in real-world, time-critical identification tasks such
    as emergency response. First, we consider the problem where the objects are
    partitioned into groups, and the goal is to identify only the group to which
    the object belongs.

  313. How slow is slow? SFA detects signals that are slower than the driving force.

    Authors: Wolfgang Konen, Patrick Koch
    Subjects: Machine Learning
    Abstract

    Slow feature analysis (SFA) is a method for extracting slowly varying driving
    forces from quickly varying nonstationary time series. We show here that it is
    possible for SFA to detect a component which is even slower than the driving
    force itself (e.g. the envelope of a modulated sine wave). It is shown that it
    depends on circumstances like the embedding dimension, the time series
    predictability, or the base frequency, whether the driving force itself or a
    slower subcomponent is detected.

  314. Dimension reduction and variable selection in case control studies via regularized likelihood optimization.

    Authors: Florentina Bunea, Adrian Barbu
    Subjects: Machine Learning
    Abstract

    Dimension reduction and variable selection are performed routinely in
    case-control studies, but the literature on the theoretical aspects of the
    resulting estimates is scarce. We bring our contribution to this literature by
    studying estimators obtained via L1 penalized likelihood optimization. We show
    that the optimizers of the L1 penalized retrospective likelihood coincide with
    the optimizers of the L1 penalized prospective likelihood.

  315. Data spectroscopy: Eigenspaces of convolution operators and clustering.

    Authors: Bin Yu, Mikhail Belkin, Tao Shi
    Subjects: Machine Learning
    Abstract

    This paper focuses on obtaining clustering information about a distribution
    from its i.i.d. samples. We develop theoretical results to understand and use
    clustering information contained in the eigenvectors of data adjacency matrices
    based on a radial kernel function with a sufficiently fast tail decay. In
    particular, we provide population analyses to gain insights into which
    eigenvectors should be used and when the clustering information for the
    distribution can be recovered from the sample.

  316. Likelihood-based semi-supervised model selection with applications to speech processing.

    Authors: Patrick J. Wolfe, Christopher M. White, Sanjeev P. Khudanpur
    Subjects: Machine Learning
    Abstract

    In conventional supervised pattern recognition tasks, model selection is
    typically accomplished by minimizing the classification error rate on a set of
    so-called development data, subject to ground-truth labeling by human experts
    or some other means. In the context of speech processing systems and other
    large-scale practical applications, however, such labeled development data are
    typically costly and difficult to obtain.

  317. Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning.

    Authors: Taiji Suzuki, Ryota Tomioka, Masashi Sugiyama
    Subjects: Machine Learning
    Abstract

    We analyze the convergence behaviour of a recently proposed algorithm for
    sparse learning called Dual Augmented Lagrangian (DAL). We theoretically
    analyze under some conditions that DAL converges super-linearly in a
    non-asymptotic and global sense. We experimentally confirm our analysis in a
    large scale $\ell_1$-regularized logistic regression problem and compare the
    efficiency of DAL algorithm to existing algorithms.

  318. High-dimensional additive modeling.

    Authors: Peter B&#xfc;hlmann, Lukas Meier, Sara van de Geer
    Subjects: Machine Learning
    Abstract

    We propose a new sparsity-smoothness penalty for high-dimensional generalized
    additive models. The combination of sparsity and smoothness is crucial for
    mathematical theory as well as performance for finite-sample data. We present a
    computationally efficient algorithm, with provable numerical convergence
    properties, for optimizing the penalized likelihood. Furthermore, we provide
    oracle results which yield asymptotic optimality of our estimator for high
    dimensional but sparse additive models.

  319. Maximum likelihood aggregation and misspecified generalized linear models.

    Authors: Philippe Rigollet
    Subjects: Machine Learning
    Abstract

    We a study a natural extension of the pure aggregation problem to handle more
    general distributions for the response in a regression setup with random or
    deterministic design. While this extension bears strong connections with
    generalized linear models, it does not require identifiability of the parameter
    of even that the model is true. It is shown that this problem can still be
    solved by likelihood maximization and we derive sharp oracle inequalities that
    hold both in expectation and with high probability.

  320. Causal Inference on Discrete Data using Additive Noise Models.

    Authors: Bernhard Sch&#xf6;lkopf, Dominik Janzing, Jonas Peters
    Subjects: Machine Learning
    Abstract

    Inferring the causal structure of a set of random variables from a finite
    sample of the joint distribution is an important problem in science. Recently,
    methods using additive noise models have been suggested to approach the case of
    continuous variables. In many situations, however, the variables of interest
    are discrete or even have only finitely many states. In this work we extend the
    notion of additive noise models to these cases.

  321. Which graphical models are difficult to learn?.

    Authors: Andrea Montanari, Jose Bento
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning the structure of Ising models (pairwise
    binary Markov random fields) from i.i.d. samples. While several methods have
    been proposed to accomplish this task, their relative merits and limitations
    remain somewhat obscure. By analyzing a number of concrete examples, we show
    that low-complexity algorithms systematically fail when the Markov random field
    develops long-range correlations. More precisely, this phenomenon appears to be
    related to the Ising model phase transition (although it does not coincide with
    it).

  322. Convex Multiview Fisher Discriminant Analysis.

    Authors: John Shawe-Taylor, Tom Diethe
    Subjects: Machine Learning
    Abstract

    Section 1.3 was incorrect, and 2.1 will be removed from further submissions.
    A rewritten version will be posted in the future.

  323. Distinguishing Cause and Effect via Second Order Exponential Models.

    Authors: Dominik Janzing, Bernhard Schoelkopf, Xiaohai Sun
    Subjects: Machine Learning
    Abstract

    We propose a method to infer causal structures containing both discrete and
    continuous variables. The idea is to select causal hypotheses for which the
    conditional density of every variable, given its causes, becomes smooth. We
    define a family of smooth densities and conditional densities by second order
    exponential models, i.e., by maximizing conditional entropy subject to first
    and second statistical moments. If some of the variables take only values in
    proper subsets of R^n, these conditionals can induce different families of
    joint distributions even for Markov-equivalent graphs.

  324. On approximation of smoothing probabilities for hidden Markov models.

    Authors: J. Lember
    Subjects: Machine Learning
    Abstract

    We consider the smoothing probabilities of hidden Markov model (HMM). We show
    that under fairly general conditions for HMM, the exponential forgetting still
    holds, and the smoothing probabilities can be well approximated with the ones
    of double sided HMM. This makes it possible to use ergodic theorems. As an
    applications we consider the pointwise maximum a posteriori segmentation, and
    show that the corresponding risks converge.

  325. The Geometry of Generalized Binary Search.

    Authors: Robert D. Nowak
    Subjects: Machine Learning
    Abstract

    This paper investigates the problem of determining a binary-valued function
    through a sequence of strategically selected queries. The focus is an algorithm
    called Generalized Binary Search (GBS), a well-known greedy approach to this
    problem. At each step, a query is selected that most evenly splits the
    hypotheses under consideration into two disjoint subsets, a natural
    generalization of the idea underlying classic binary search and Shannon-Fano
    coding.

  326. The Geometry of Generalized Binary Search.

    Authors: Robert D. Nowak
    Subjects: Machine Learning
    Abstract

    This paper investigates the problem of determining a binary-valued function
    through a sequence of strategically selected queries. The focus is an algorithm
    called Generalized Binary Search (GBS), a well-known greedy approach to this
    problem. At each step, a query is selected that most evenly splits the
    hypotheses under consideration into two disjoint subsets, a natural
    generalization of the idea underlying classic binary search and Shannon-Fano
    coding.

  327. A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem.

    Authors: Bharath Sriperumbudur, David Torres, Gert Lanckriet
    Subjects: Machine Learning
    Abstract

    In this paper, we consider the sparse eigenvalue problem wherein the goal is
    to obtain a sparse solution to the generalized eigenvalue problem. We achieve
    this by constraining the cardinality of the solution to the generalized
    eigenvalue problem and obtain sparse principal component analysis (PCA), sparse
    canonical correlation analysis (CCA) and sparse Fisher discriminant analysis
    (FDA) as special cases.

  328. A Stochastic Model for Collaborative Recommendation.

    Authors: G&#xe9;rard Biau, Benoit Cadre, Laurent Rouvi&#xe8;re
    Subjects: Machine Learning
    Abstract

    Collaborative recommendation is an information-filtering technique that
    attempts to present information items (movies, music, books, news, images, Web
    pages, etc.) that are likely of interest to the Internet user. Traditionally,
    collaborative systems deal with situations with two types of variables, users
    and items. In its most common form, the problem is framed as trying to estimate
    ratings for items that have not yet been consumed by a user. Despite
    wide-ranging literature, little is known about the statistical properties of
    recommendation systems.

  329. A Stochastic Model for Collaborative Recommendation.

    Authors: G&#xe9;rard Biau, Benoit Cadre, Laurent Rouvi&#xe8;re
    Subjects: Machine Learning
    Abstract

    Collaborative recommendation is an information-filtering technique that
    attempts to present information items (movies, music, books, news, images, Web
    pages, etc.) that are likely of interest to the Internet user. Traditionally,
    collaborative systems deal with situations with two types of variables, users
    and items. In its most common form, the problem is framed as trying to estimate
    ratings for items that have not yet been consumed by a user. Despite
    wide-ranging literature, little is known about the statistical properties of
    recommendation systems.

  330. Node harvest: simple and interpretable regression and classification.

    Authors: Nicolai Meinshausen
    Subjects: Machine Learning
    Abstract

    When choosing a suitable technique for regression and classification with
    multivariate predictor variables, one is often faced with a tradeoff between
    interpretability and high predictive accuracy. To give a classical example,
    classification and regression trees are easy to understand and interpret. Tree
    ensembles like Random Forests provide, on the other hand, usually more accurate
    predictions.

  331. Mean-Field Theory of Meta-Learning.

    Authors: Dariusz Plewczynski
    Subjects: Machine Learning
    Abstract

    We discuss here the mean-field theory for a cellular automata model of
    meta-learning. The meta-learning is the process of combining outcomes of
    individual learning procedures in order to determine the final decision with
    higher accuracy than any single learning method. Our method is constructed from
    an ensemble of interacting, learning agents, that acquire and process incoming
    information using various types, or different versions of machine learning
    algorithms.

  332. Mean-Field Theory of Meta-Learning.

    Authors: Dariusz Plewczynski
    Subjects: Machine Learning
    Abstract

    We discuss here the mean-field theory for a cellular automata model of
    meta-learning. The meta-learning is the process of combining outcomes of
    individual learning procedures in order to determine the final decision with
    higher accuracy than any single learning method. Our method is constructed from
    an ensemble of interacting, learning agents, that acquire and process incoming
    information using various types, or different versions of machine learning
    algorithms.

  333. Distance Dependent Chinese Restaurant Processes.

    Authors: David M. Blei, Peter I. Frazier
    Subjects: Machine Learning
    Abstract

    We develop the distance dependent Chinese restaurant process (CRP), a
    flexible class of distributions over partitions that allows for
    non-exchangeability. This class can be used to model many kinds of dependencies
    between data in infinite clustering models, including dependencies across time
    or space. We examine the properties of the distance dependent CRP, discuss its
    connections to Bayesian nonparametric mixture models, and derive a Gibbs
    sampler for both observed and mixture settings. We study its performance with
    three text corpora.

  334. Functional learning through kernels.

    Authors: Stephane Canu, Xavier Mary, Alain Rakotomamonjy
    Subjects: Machine Learning
    Abstract

    This paper reviews the functional aspects of statistical learning theory. The
    main point under consideration is the nature of the hypothesis set when no
    prior information is available but data. Within this framework we first discuss
    about the hypothesis set: it is a vectorial space, it is a set of pointwise
    defined functions, and the evaluation functional on this set is a continuous
    mapping. Based on these principles an original theory is developed generalizing
    the notion of reproduction kernel Hilbert space to non hilbertian sets.

  335. BRAINSTORMING: Consensus Learning in Practice.

    Authors: Dariusz Plewczynski
    Subjects: Machine Learning
    Abstract

    We present here an introduction to Brainstorming approach, that was recently
    proposed as a consensus meta-learning technique, and used in several practical
    applications in bioinformatics and chemoinformatics. The consensus learning
    denotes heterogeneous theoretical classification method, where one trains an
    ensemble of machine learning algorithms using different types of input training
    data representations. In the second step all solutions are gathered and the
    consensus is build between them.

  336. SparseCodePicking: feature extraction in mass spectrometry using sparse coding algorithms.

    Authors: Theodore Alexandrov, Klaus Steinhorst, Oliver Keszoecze, Stefan Schiffler
    Subjects: Machine Learning
    Abstract

    Mass spectrometry (MS) is an important technique for chemical profiling which
    calculates for a sample a high dimensional histogram-like spectrum. A crucial
    step of MS data processing is the peak picking which selects peaks containing
    information about molecules with high concentrations which are of interest in
    an MS investigation. We present a new procedure of the peak picking based on a
    sparse coding algorithm. Given a set of spectra of different classes, i.e. with
    different positions and heights of the peaks, this procedure can extract peaks
    by means of unsupervised learning.

  337. Statistical Decision Making for Authentication and Intrusion Detection.

    Authors: Christos Dimitrakakis, Aikaterini Mitrokotsa
    Subjects: Machine Learning
    Abstract

    User authentication and intrusion detection differ from standard
    classification problems in that while we have data generated from legitimate
    users, impostor or intrusion data is scarce or non-existent. We review existing
    techniques for dealing with this problem and propose a novel alternative based
    on a principled statistical decision-making view point. We examine the
    technique on a toy problem and validate it on complex real-world data from an
    RFID based access control system. The results indicate that it can
    significantly outperform the classical world model approach.

  338. Expectation Propagation on the Maximum of Correlated Normal Variables.

    Authors: Philipp Hennig
    Subjects: Machine Learning
    Abstract

    Many inference problems involving questions of optimality ask for the maximum
    or the minimum of a finite set of unknown quantities. This technical report
    derives the first two posterior moments of the maximum of two correlated
    Gaussian variables and the first two posterior moments of the two generating
    variables (corresponding to Gaussian approximations minimizing relative
    entropy).

  339. Laplacian Support Vector Machines Trained in the Primal.

    Authors: Stefano Melacci, Mikhail Belkin
    Subjects: Machine Learning
    Abstract

    In the last few years, due to the growing ubiquity of unlabeled data, much
    effort has been spent by the machine learning community to develop better
    understanding and improve the quality of classifiers exploiting unlabeled data.
    Following the manifold regularization approach, Laplacian Support Vector
    Machines (LapSVMs) have shown the state of the art performance in
    semi--supervised classification. In this paper we present two strategies to
    solve the primal LapSVM problem, in order to overcome some issues of the
    original dual formulation.

  340. Dirichlet Process Mixtures of Generalized Linear Models.

    Authors: David M. Blei, Lauren A. Hannah, Warren B. Powell
    Subjects: Machine Learning
    Abstract

    We propose Dirichlet Process-Generalized Linear Models (DP-GLM), a new method
    of nonparametric regression that accommodates continuous and categorical
    inputs, and any response that can be modeled by a generalized linear model. We
    prove conditions for the asymptotic unbiasedness of the DP-GLM regression mean
    function estimate and give a practical example for when those conditions hold.
    Additionally, we provide Bayesian bounds on the distance of the estimate from
    the true mean function based on the number of observations and posterior
    samples.

  341. Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures.

    Authors: Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky
    Subjects: Machine Learning
    Abstract

    The problem of learning tree-structured Gaussian graphical models from i.i.d.
    samples is considered. The influence of the tree structure and the parameters
    of the Gaussian distribution on the learning rate as the number of samples
    increases is discussed. Specifically, the error exponent corresponding to the
    event that the estimated tree structure differs from the actual unknown tree
    structure of the distribution is analyzed. Finding the error exponent reduces
    to a least-squares problem in the very noisy learning regime.

  342. SpicyMKL.

    Authors: Taiji Suzuki, Ryota Tomioka
    Subjects: Machine Learning
    Abstract

    We propose a new optimization algorithm for Multiple Kernel Learning (MKL)
    with general convex loss functions. The proposed algorithm is a proximal
    minimization method that utilizes the "smoothed" dual objective function and
    converges super-linearly. The sparsity of the intermediate solution plays a
    crucial role for the efficiency of the proposed algorithm. Consequently our
    algorithm scales well with increasing number of kernels. Experimental results
    show that our algorithm is favorable against existing methods especially when
    the number of kernels is large (> 1000).

  343. Telling cause from effect based on high-dimensional observations.

    Authors: Dominik Janzing, Patrik O. Hoyer, Bernhard Schoelkopf
    Subjects: Machine Learning
    Abstract

    We describe a method for inferring linear causal relations among
    multi-dimensional variables. The idea is to use an asymmetry between the
    distributions of cause and effect that occurs if both the covariance matrix of
    the cause and the structure matrix mapping cause to the effect are
    independently chosen. The method works for both stochastic and deterministic
    causal relations, provided that the dimensionality is sufficiently high (in
    some experiments, 5 was enough). It is applicable to Gaussian as well as
    non-Gaussian data.

  344. Initialization Free Graph Based Clustering.

    Authors: Laurent Galluccio, Olivier J.J. Michel, Pierre Comon, Eric Slezak, Alfred O. Hero
    Subjects: Machine Learning
    Abstract

    This paper proposes an original approach to cluster multi-component data
    sets, including an estimation of the number of clusters. From the construction
    of a minimal spanning tree with Prim's algorithm, and the assumption that the
    vertices are approximately distributed according to a Poisson distribution, the
    number of clusters is estimated by thresholding the Prim's trajectory. The
    corresponding cluster centroids are then computed in order to initialize the
    generalized Lloyd's algorithm, also known as $K$-means, which allows to
    circumvent initialization problems.

  345. Rumors in a Network: Who's the Culprit?.

    Authors: Devavrat Shah, Tauhid Zaman
    Subjects: Machine Learning
    Abstract

    Motivated by applications such as the detection of sources of worms or
    viruses in computer networks, identification of the origin of infectious
    diseases, determining the causes of cascading failures in systems such as
    financial markets, or inferring the leader in a social network, we study the
    question of inferring the source of a {\em rumor} in a network based on the
    information about {\em rumor infected} nodes and the underlying network
    structure. We start by proposing a natural, effective model for the spread of
    the rumor in a network based on the classical SIR model.

  346. Trek separation for Gaussian graphical models.

    Authors: Seth Sullivant, Kelli Talaska, Jan Draisma
    Subjects: Machine Learning
    Abstract

    Gaussian graphical models are semi-algebraic subsets of the cone of positive
    definite covariance matrices. Submatrices with low rank correspond to
    generalizations of conditional independence constraints on collections of
    random variables. We give a precise graph-theoretic characterization of when
    submatrices of the covariance matrix have small rank for a general class of
    mixed graphs that includes directed acyclic and undirected graphs as special
    cases. Our new trek separation criterion generalizes the familiar d-separation
    criterion.

  347. Trek separation for Gaussian graphical models.

    Authors: Seth Sullivant, Kelli Talaska, Jan Draisma
    Subjects: Machine Learning
    Abstract

    Gaussian graphical models are semi-algebraic subsets of the cone of positive
    definite covariance matrices. Submatrices with low rank correspond to
    generalizations of conditional independence constraints on collections of
    random variables. We give a precise graph-theoretic characterization of when
    submatrices of the covariance matrix have small rank for a general class of
    mixed graphs that includes directed acyclic and undirected graphs as special
    cases. Our new trek separation criterion generalizes the familiar d-separation
    criterion.

  348. Structured Variable Selection with Sparsity-Inducing Norms.

    Authors: Francis Bach, Jean-Yves Audibert, Rodolphe Jenatton
    Subjects: Machine Learning
    Abstract

    We consider the empirical risk minimization problem for linear supervised
    learning, with regularization by structured sparsity-inducing norms. These are
    defined as sums of Euclidean norms on certain subsets of variables, extending
    the usual $\ell_1$-norm and the group $\ell_1$-norm by allowing the subsets to
    overlap. This leads to a specific set of allowed nonzero patterns for the
    solutions of such problems.

  349. Computing p-values of LiNGAM outputs via Multiscale Bootstrap.

    Authors: Yusuke Komatsu, Shohei Shimizu, Hidetoshi Shimodaira
    Subjects: Machine Learning
    Abstract

    Structural equation models and Bayesian networks have been widely used to
    study causal relationships between continuous variables. Recently, a
    non-Gaussian method called LiNGAM was proposed to discover such causal models
    and has been extended in various directions. An important problem with LiNGAM
    is that the results are affected by the random sampling of the data as with any
    statistical method. Thus, some analysis of the statistical reliability or
    confidence level should be conducted. A common method to evaluate a confidence
    level is a bootstrap method.

  350. A Nonconformity Approach to Model Selection for SVMs.

    Authors: David R. Hardoon, Zakria Hussain, John Shawe-Taylor
    Subjects: Machine Learning
    Abstract

    We investigate the issue of model selection and the use of the nonconformity
    (strangeness) measure in batch learning. Using the nonconformity measure we
    propose a new training algorithm that helps avoid the need for Cross-Validation
    or Leave-One-Out model selection strategies. We provide a new generalisation
    error bound using the notion of nonconformity to upper bound the loss of each
    test example and show that our proposed approach is comparable to standard
    model selection methods, but with theoretical guarantees of success and faster
    convergence.

  351. Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions.

    Authors: Ery Arias-Castro
    Subjects: Machine Learning
    Abstract

    In the context of clustering, we consider a generative model in a Euclidean
    ambient space with clusters of different shapes, dimensions, sizes and
    densities. In an asymptotic setting where the number of points becomes large,
    we obtain theoretical guaranties for a few emblematic methods based on pairwise
    distances: a simple algorithm based on the extraction of connected components
    in a neighborhood graph; the spectral clustering method of Ng, Jordan and
    Weiss; and hierarchical clustering with single linkage.

  352. Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity.

    Authors: Seyoung Kim
    Subjects: Machine Learning
    Abstract

    We consider the problem of learning a sparse multi-task regression with an
    application to a genetic association mapping problem for discovering genetic
    markers that influence expression levels of multiple genes jointly. In
    particular, we consider the case where the structure over the outputs can be
    represented as a tree with leaf nodes as outputs and internal nodes as clusters
    of the outputs at multiple granularity, and aim to recover the common set of
    relevant inputs for each output cluster.

  353. On Ranking Senators By Their Votes.

    Authors: Mugizi Rwebangira
    Subjects: Machine Learning
    Abstract

    The problem of ranking a set of objects given some measure of similarity is
    one of the most basic in machine learning. Recently Agarwal proposed a method
    based on techniques in semi-supervised learning utilizing the graph Laplacian.
    In this work we consider a novel application of this technique to ranking
    binary choice data and apply it specifically to ranking US Senators by their
    ideology.

  354. Structured Sparse Principal Component Analysis.

    Authors: Francis Bach, Rodolphe Jenatton, Guillaume Obozinski
    Subjects: Machine Learning
    Abstract

    We present an extension of sparse PCA, or sparse dictionary learning, where
    the sparsity patterns of all dictionary elements are structured and constrained
    to belong to a prespecified set of shapes. This \emph{structured sparse PCA} is
    based on a structured regularization recently introduced by [1]. While
    classical sparse priors only deal with \textit{cardinality}, the regularization
    we use encodes higher-order information about the data. We propose an efficient
    and simple optimization procedure to solve this problem.

  355. High-dimensional Graphical Model Search with gRapHD R Package.

    Authors: Gabriel C. G. de Abreu, Rodrigo Labouriau, David Edwards
    Subjects: Machine Learning
    Abstract

    This paper presents the R package gRapHD for efficient selection of
    high-dimensional undirected graphical models. The package provides tools for
    selecting trees, forests and decomposable models minimizing information
    criteria such as AIC or BIC, and for displaying the independence graphs of the
    models. It has also some useful tools for analysing graphical structures. It
    supports the use of discrete, continuous, or both types of variables.

  356. Kernels for Measures Defined on the Gram Matrix of their Support.

    Authors: Marco Cuturi
    Subjects: Machine Learning
    Abstract

    We present in this work a new family of kernels to compare positive measures
    on arbitrary spaces $\Xcal$ endowed with a positive kernel $\kappa$, which
    translates naturally into kernels between histograms or clouds of points. We
    first cover the case where $\Xcal$ is Euclidian, and focus on kernels which
    take into account the variance matrix of the mixture of two measures to compute
    their similarity.

  357. Geometry of the restricted Boltzmann machine.

    Authors: Bernd Sturmfels, Maria Angelica Cueto, Jason Morton
    Subjects: Machine Learning
    Abstract

    The restricted Boltzmann machine is a graphical model for binary random
    variables. Based on a complete bipartite graph separating hidden and observed
    variables, it is the binary analog to the factor analysis model. We study this
    graphical model from the perspectives of algebraic statistics and tropical
    geometry, starting with the observation that its Zariski closure is a Hadamard
    power of the first secant variety of the Segre variety of projective lines.

  358. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies.

    Authors: Michael I. Jordan, David M. Blei, Thomas L. Griffiths
    Subjects: Machine Learning
    Abstract

    We present the nested Chinese restaurant process (nCRP), a stochastic process
    which assigns probability distributions to infinitely-deep,
    infinitely-branching trees. We show how this stochastic process can be used as
    a prior distribution in a Bayesian nonparametric model of document collections.
    Specifically, we present an application to information retrieval in which
    documents are modeled as paths down a random tree, and the preferential
    attachment dynamics of the nCRP leads to clustering of documents according to
    sharing of topics at multiple levels of abstraction.

  359. Learning Bayesian Networks with the bnlearn Package.

    Authors: Marco Scutari
    Subjects: Machine Learning
    Abstract

    bnlearn is an R package which includes several algorithms for learning the
    structure of Bayesian networks with either discrete or continuous variables.
    Both constraint-based and score-based algorithms are implemented, and can use
    the functionality provided by the snow package to improve their performance via
    parallel computing. Several network scores and conditional independence
    algorithms are available for both the learning algorithms and independent use.
    Advanced plotting options are provided by the Rgraphviz package.

  360. The Optimal Unbiased Value Estimator and its Relation to LSTD, TD and MC.

    Authors: Steffen Gr&#xfc;new&#xe4;lder, Klaus Obermayer
    Subjects: Machine Learning
    Abstract

    In this analytical study we derive the optimal unbiased value estimator (MVU)
    and compare its statistical risk to three well known value estimators: Temporal
    Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares
    Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to
    the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ
    for most cyclic MRPs as LSTD is then typically biased. More generally, we show
    that estimators that fulfill the Bellman equation can only be unbiased for
    special cyclic MRPs.

RSS-материал