Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. They were first dedicated to linear variable
selection but numerous extensions have now emerged such as structured sparsity
or kernel selection. It turns out that many of the related estimation problems
can be cast as convex optimization problems by regularizing the empirical risk
with appropriate non-smooth norms. The goal of this paper is to present from a
general perspective optimization tools and techniques dedicated to such
sparsity-inducing penalties.
We consider the problem of optimizing the sum of a smooth convex function and
a non-smooth convex function using proximal-gradient methods, where an error is
present in the calculation of the gradient of the smooth term or in the
proximity operator with respect to the non-smooth term.
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. While naturally cast as a combinatorial
optimization problem, variable or feature selection admits a convex relaxation
through the regularization by the $\ell_1$-norm. In this paper, we consider
situations where we are not only interested in sparsity, but where some
structural prior knowledge is available as well.
In this paper we study the kernel multiple ridge regression framework, which
we refer to as multi-task regression, using penalization techniques. The
theoretical analysis of this problem shows that the key element appearing for
an optimal calibration is the covariance matrix of the noise between the
different tasks. We present a new algorithm to estimate this covariance matrix,
based on the concept of minimal penalty, which was previously used in the
single-task regression framework to estimate the variance of the noise.
Nonnegative matrix factorization (NMF) is now a common tool for audio source
separation. When learning NMF on large audio databases, one major drawback is
that the complexity in time is O(FKN) when updating the dictionary (where (F;N)
is the dimension of the input power spectrograms, and K the number of basis
spectra), thus forbidding its application on signals longer than an hour. We
provide an online algorithm with a complexity of O(FK) in time and memory for
updates in the dictionary.
Inverse inference, or "brain reading", is a recent paradigm for analyzing
functional magnetic resonance imaging (fMRI) data, based on pattern recognition
and statistical learning. By predicting some cognitive variables related to
brain activation maps, this approach aims at decoding brain activity. Inverse
inference takes into account the multivariate information between voxels and is
currently the only way to assess how precisely some cognitive information is
encoded by the activity of neural populations within the whole brain.
We consider a class of learning problems regularized by a structured
sparsity-inducing norm defined as the sum of l_2 or l_infinity norms over
groups of variables. Whereas much effort has been put in developing fast
optimization techniques when the groups are disjoint or embedded in a
hierarchy, we address here the case of general overlapping groups.
Modeling data with linear combinations of a few elements from a learned
dictionary has been the focus of much recent research in machine learning,
neuroscience and signal processing. For signals such as natural images that
admit such sparse representations, it is now well established that these models
are well suited to restoration tasks. In this context, learning the dictionary
amounts to solving a large-scale matrix factorization problem, which can be
done efficiently with classical optimization tools.
Sparse methods for supervised learning aim at finding good linear predictors
from as few variables as possible, i.e., with small cardinality of their
supports. This combinatorial selection problem is often turned into a convex
optimization problem by replacing the cardinality function by its convex
envelope (tightest convex lower bound), in this case the L1-norm.
Sparse coding consists in representing signals as sparse linear combinations
of atoms selected from a dictionary. We consider an extension of this framework
where the atoms are further assumed to be embedded in a tree. This is achieved
using a recently introduced tree-structured sparse regularization norm, which
has proven useful in several applications. This norm leads to regularized
problems that are difficult to optimize, and we propose in this paper efficient
algorithms for solving them.
We consider a class of learning problems that involve a structured
sparsity-inducing norm defined as the sum of $\ell_\infty$-norms over groups of
variables. Whereas a lot of effort has been put in developing fast optimization
methods when the groups are disjoint or embedded in a specific hierarchical
structure, we address here the case of general overlapping groups. To this end,
we show that the corresponding optimization problem is related to network flow
optimization. More precisely, the proximal problem associated with the norm we
consider is dual to a quadratic min-cost flow problem.
We use convex relaxation techniques to produce lower bounds on the optimal
value of subset selection problems and generate good approximate solutions. We
then explicitly bound the quality of these relaxations by studying the
approximation ratio of sparse eigenvalue relaxations. Our results are used to
improve the performance of branch-and-bound algorithms to produce exact
solutions to subset selection problems.
Graphs provide an efficient tool for object representation in various
computer vision applications. Once graph-based representations are constructed,
an important question is how to compare graphs. This problem is often
formulated as a graph matching problem where one seeks a mapping between
vertices of two graphs which optimally aligns their structure. In the classical
formulation of graph matching, only one-to-one correspondences between vertices
are considered.
Sparse coding--that is, modelling data vectors as sparse linear combinations
of basis elements--is widely used in machine learning, neuroscience, signal
processing, and statistics. This paper focuses on the large-scale matrix
factorization problem that consists of learning the basis set, adapting it to
specific data. Variations of this problem include dictionary learning in signal
processing, non-negative matrix factorization and sparse principal component
analysis.
Most of the non-asymptotic theoretical work in regression is carried out for
the square loss, where estimators can be obtained through closed-form
expressions. In this paper, we use and extend tools from the convex
optimization literature, namely self-concordant functions, to provide simple
extensions of theoretical results for the square loss to the logistic loss.
We consider the empirical risk minimization problem for linear supervised
learning, with regularization by structured sparsity-inducing norms. These are
defined as sums of Euclidean norms on certain subsets of variables, extending
the usual $\ell_1$-norm and the group $\ell_1$-norm by allowing the subsets to
overlap. This leads to a specific set of allowed nonzero patterns for the
solutions of such problems.
This paper tackles the problem of selecting among several linear estimators
in non-parametric regression; this includes model selection for linear
regression, the choice of a regularization parameter in kernel ridge regression
or spline smoothing, and the choice of a kernel in multiple kernel learning. We
propose a new algorithm which first estimates consistently the variance of the
noise, based upon the concept of minimal penalty which was previously
introduced in the context of model selection.
We present an extension of sparse PCA, or sparse dictionary learning, where
the sparsity patterns of all dictionary elements are structured and constrained
to belong to a prespecified set of shapes. This \emph{structured sparse PCA} is
based on a structured regularization recently introduced by [1]. While
classical sparse priors only deal with \textit{cardinality}, the regularization
we use encodes higher-order information about the data. We propose an efficient
and simple optimization procedure to solve this problem.
We consider the problem of high-dimensional non-linear variable selection for
supervised learning. Our approach is based on performing linear selection among
exponentially many appropriately defined positive definite kernels that
characterize non-linear interactions between the original variables.