Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Research output: Contribution to journalConference articlepeer-review

Abstract

We study K-armed bandit problems where the reward distributions of the arms are all supported on the [0, 1] interval. Maillard sampling [30], an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting [11] while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we analyze the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling and a special case of Minimum Empirical Divergence (MED) [19] for achieving a KL-style finite-time gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has an adaptive worst-case regret bound of the form O(pµ(1 − µ)KT ln K + K ln T), where µ is the expected reward of the optimal arm, and T is the time horizon length; this is the first time such adaptivity is reported in the literature for an algorithm with asymptotic optimality guarantees.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume36
StatePublished - 2023
Event37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States
Duration: Dec 10 2023Dec 16 2023

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards'. Together they form a unique fingerprint.

Cite this