Multi-armed bandit problems are the most basic examples of sequential
decision problems with an exploration-exploitation trade-off. This is the
balance between staying with the option that gave highest payoffs in the past
and exploring new options that might give higher payoffs in the future.
Although the study of bandit problems dates back to the Thirties,
exploration-exploitation trade-offs arise in several modern applications, such
as ad placement, website optimization, and packet routing. Mathematically, a
multi-armed bandit is defined by the payoff process associated with each
option.
We provide the first algorithm for online bandit linear optimization whose
regret after T rounds is of order sqrt{Td ln N} on any finite class X of N
actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is
infinite. These bounds are not improvable in general. The basic idea utilizes
tools from convex geometry to construct what is essentially an optimal
exploration basis. We also present an application to a model of linear bandits
with expert advice.
Most online algorithms used in machine learning today are based on variants
of mirror descent or follow-the-leader. In this paper, we present an online
algorithm based on a completely different approach, which combines "random
playout" and randomized rounding of loss subgradients. As an application of our
approach, we provide the first computationally efficient online algorithm for
collaborative filtering with norm-constrained matrices. As a second
application, we solve an open question linking batch learning and transductive
online learning.
We develop a coherent framework for integrative simultaneous analysis of the
exploration-exploitation and model order selection trade-offs. We improve over
our preceding results on the same subject (Seldin et al., 2011) by combining
PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a
combination is also of independent interest for studies of multiple
simultaneously evolving martingales.
We study online learning when individual instances are corrupted by random
noise. We assume the noise distribution is unknown, and may change over time
with no restriction other than having zero mean and bounded variance. Our
technique relies on a family of unbiased estimators for non-linear functions,
which may be of independent interest. We show that a variant of online gradient
descent can learn functions in any dot-product (e.g., polynomial) or Gaussian
kernel space with any analytic convex loss function.
We describe and analyze efficient algorithms for learning a linear predictor
from examples when the learner can only view a few attributes of each training
example. This is the case, for instance, in medical research, where each
patient participating in the experiment is only willing to go through a small
number of tests. Our analysis bounds the number of additional examples
sufficient to compensate for the lack of full information on each training
example.