We address online linear optimization problems when the possible actions of
the decision maker are represented by binary vectors. The regret of the
decision maker is the difference between her realized loss and the best loss
she would have achieved by picking, in hindsight, the best possible action. Our
goal is to understand the magnitude of the best possible (minimax) regret. We
study the problem under three different assumptions for the feedback the
decision maker receives: full information, and the partial information models
of the so-called "semi-bandit" and "bandit" problems.
This paper studies the deviations of the regret in a stochastic multi-armed
bandit problem. When the total number of plays n is known beforehand by the
agent, Audibert et al. (2009) exhibit a policy such that with probability at
least 1-1/n, the regret of the policy is of order log(n). They have also shown
that such a property is not shared by the popular ucb1 policy of Auer et al.
(2002). This work first answers an open question: it extends this negative
result to any anytime policy.
It has been recently shown that, under the margin (or low noise) assumption,
there exist classifiers attaining fast rates of convergence of the excess Bayes
risk, i.e., the rates faster than $n^{-1/2}$. The works on this subject
suggested the following two conjectures: (i) the best achievable fast rate is
of the order $n^{-1}$, and (ii) the plug-in classifiers generally converge
slower than the classifiers based on empirical risk minimization. We show that
both conjectures are not correct.
This habilitation thesis presents several contributions to (1) the
PAC-Bayesian analysis of statistical learning, (2) the three aggregation
problems: given d functions, how to predict as well as (i) the best of these d
functions (model selection type aggregation), (ii) the best convex combination
of these d functions, (iii) the best linear combination of these d functions,
(3) the multi-armed bandit problems.
We consider the problem of robustly predicting as well as the best linear
combination of d given functions in least squares regression, and variants of
this problem including constraints on the parameters of the linear combination.
For the ridge estimator and the ordinary least squares estimator, and their
variants, we provide new risk bounds of order d/n without logarithmic factor
unlike some standard results, where n is the size of the training data. We also
provide a new estimator with better deviations in presence of heavy-tailed
noise.
We consider the problem of predicting as well as the best linear combination
of d given functions in least squares regression under $L^\infty$ constraints
on the linear combination. When the input distribution is known, there already
exists an algorithm having an expected excess risk of order d/n, where n is the
size of the training data.
We consider the empirical risk minimization problem for linear supervised
learning, with regularization by structured sparsity-inducing norms. These are
defined as sums of Euclidean norms on certain subsets of variables, extending
the usual $\ell_1$-norm and the group $\ell_1$-norm by allowing the subsets to
overlap. This leads to a specific set of allowed nonzero patterns for the
solutions of such problems.
We develop minimax optimal risk bounds for the general learning task
consisting in predicting as well as the best function in a reference set
$\mathcal{G}$ up to the smallest possible additive term, called the convergence
rate. When the reference set is finite and when $n$ denotes the size of the
training data, we provide minimax convergence rates of the form
$C(\frac{\log|\mathcal{G}|}{n})^v$ with tight evaluation of the positive
constant $C$ and with exact $0<v\le1$, the latter value depending on the
convexity of the loss function and on the level of noise in the output
distribution.
We develop minimax optimal risk bounds for the general learning task
consisting in predicting as well as the best function in a reference set
$\mathcal{G}$ up to the smallest possible additive term, called the convergence
rate. When the reference set is finite and when $n$ denotes the size of the
training data, we provide minimax convergence rates of the form
$C(\frac{\log|\mathcal{G}|}{n})^v$ with tight evaluation of the positive
constant $C$ and with exact $0<v\le1$, the latter value depending on the
convexity of the loss function and on the level of noise in the output
distribution.