TY - GEN
T1 - Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
AU - Qin, Hao
AU - Jun, Kwang Sung
AU - Zhang, Chicheng
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - We study K-armed bandit problems where the reward distributions of the arms are all supported on the [0, 1] interval. Maillard sampling [30], an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting [11] while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we analyze the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling and a special case of Minimum Empirical Divergence (MED) [19] for achieving a KL-style finite-time gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has an adaptive worst-case regret bound of the form O(pµ∗(1 − µ∗)KT ln K + K ln T), where µ∗ is the expected reward of the optimal arm, and T is the time horizon length; this is the first time such adaptivity is reported in the literature for an algorithm with asymptotic optimality guarantees.
AB - We study K-armed bandit problems where the reward distributions of the arms are all supported on the [0, 1] interval. Maillard sampling [30], an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting [11] while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we analyze the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling and a special case of Minimum Empirical Divergence (MED) [19] for achieving a KL-style finite-time gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has an adaptive worst-case regret bound of the form O(pµ∗(1 − µ∗)KT ln K + K ln T), where µ∗ is the expected reward of the optimal arm, and T is the time horizon length; this is the first time such adaptivity is reported in the literature for an algorithm with asymptotic optimality guarantees.
UR - https://www.scopus.com/pages/publications/85182192052
UR - https://www.scopus.com/pages/publications/85182192052#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:85182192052
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
A2 - Oh, A.
A2 - Neumann, T.
A2 - Globerson, A.
A2 - Saenko, K.
A2 - Hardt, M.
A2 - Levine, S.
PB - Neural information processing systems foundation
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
Y2 - 10 December 2023 through 16 December 2023
ER -