Decentralized Multi-Armed Bandit with Multiple Distributed Players.

link: http://arxiv.org/abs/0910.2065
Abstract

We consider multi-armed bandit with distributed players, where each player
independently samples one of N stochastic processes with unknown parameters and
accrues reward in each slot without information exchange. Users choosing the
same arm collide, and none or only one receives reward depending on the
collision model. This problem can be formulated as a decentralized multi-armed
bandit problem. We measure the performance of a decentralized policy by the
system regret, defined as the total reward loss with respect to the optimal
performance under the perfect scenario where all arm parameters are known to
all users and collisions among users are eliminated through perfect scheduling.
We show that the minimum system regret grows with time at the same logarithmic
order as in the centralized counterpart, where users exchange observations and
make decisions jointly. A decentralized policy is constructed to achieve this
optimal order. Furthermore, we show that the proposed policy belongs to a
general class of decentralized polices, for which a uniform performance
benchmark is established.