Improved Regret Bounds of Bilinear Bandits using Action Space Analysis

Kyoungseok Jang, Kwang Sung Jun, Se Young Yun, Wanmo Kang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

We consider the bilinear bandit problem where the learner chooses a pair of arms, each from two different action spaces of dimension d1 and d2, respectively. The learner then receives a reward whose expectation is a bilinear function of the two chosen arms with an unknown matrix parameter Θ ∈ Rd1×d2 with rank r. Despite abundant applications such as drug discovery, the optimal regret rate is unknown for this problem, though it was conjectured to be Õ(√d1d2(d1 + d2)rT) by Jun et al. (2019) where Õ ignores polylogarithmic factors in T. In this paper, we make progress towards closing the gap between the upper and lower bound on the optimal regret. First, we reject the conjecture above by proposing algorithms that achieve the regret Õ(√d1d2(d1 + d2)T) using the fact that the action space dimension O(d1+d2) is significantly lower than the matrix parameter dimension O(d1d2). Second, we additionally devise an algorithm with better empirical performance than previous algorithms.

Original languageEnglish (US)
Title of host publicationProceedings of the 38th International Conference on Machine Learning, ICML 2021
PublisherML Research Press
Pages4744-4754
Number of pages11
ISBN (Electronic)9781713845065
StatePublished - 2021
Event38th International Conference on Machine Learning, ICML 2021 - Virtual, Online
Duration: Jul 18 2021Jul 24 2021

Publication series

NameProceedings of Machine Learning Research
Volume139
ISSN (Electronic)2640-3498

Conference

Conference38th International Conference on Machine Learning, ICML 2021
CityVirtual, Online
Period7/18/217/24/21

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Improved Regret Bounds of Bilinear Bandits using Action Space Analysis'. Together they form a unique fingerprint.

Cite this