TY - GEN

T1 - Improved Regret Bounds of Bilinear Bandits using Action Space Analysis

AU - Jang, Kyoungseok

AU - Jun, Kwang Sung

AU - Yun, Se Young

AU - Kang, Wanmo

N1 - Publisher Copyright:
Copyright © 2021 by the author(s)

PY - 2021

Y1 - 2021

N2 - We consider the bilinear bandit problem where the learner chooses a pair of arms, each from two different action spaces of dimension d1 and d2, respectively. The learner then receives a reward whose expectation is a bilinear function of the two chosen arms with an unknown matrix parameter Θ∗ ∈ Rd1×d2 with rank r. Despite abundant applications such as drug discovery, the optimal regret rate is unknown for this problem, though it was conjectured to be Õ(√d1d2(d1 + d2)rT) by Jun et al. (2019) where Õ ignores polylogarithmic factors in T. In this paper, we make progress towards closing the gap between the upper and lower bound on the optimal regret. First, we reject the conjecture above by proposing algorithms that achieve the regret Õ(√d1d2(d1 + d2)T) using the fact that the action space dimension O(d1+d2) is significantly lower than the matrix parameter dimension O(d1d2). Second, we additionally devise an algorithm with better empirical performance than previous algorithms.

AB - We consider the bilinear bandit problem where the learner chooses a pair of arms, each from two different action spaces of dimension d1 and d2, respectively. The learner then receives a reward whose expectation is a bilinear function of the two chosen arms with an unknown matrix parameter Θ∗ ∈ Rd1×d2 with rank r. Despite abundant applications such as drug discovery, the optimal regret rate is unknown for this problem, though it was conjectured to be Õ(√d1d2(d1 + d2)rT) by Jun et al. (2019) where Õ ignores polylogarithmic factors in T. In this paper, we make progress towards closing the gap between the upper and lower bound on the optimal regret. First, we reject the conjecture above by proposing algorithms that achieve the regret Õ(√d1d2(d1 + d2)T) using the fact that the action space dimension O(d1+d2) is significantly lower than the matrix parameter dimension O(d1d2). Second, we additionally devise an algorithm with better empirical performance than previous algorithms.

UR - http://www.scopus.com/inward/record.url?scp=85121793373&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85121793373&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85121793373

T3 - Proceedings of Machine Learning Research

SP - 4744

EP - 4754

BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021

PB - ML Research Press

T2 - 38th International Conference on Machine Learning, ICML 2021

Y2 - 18 July 2021 through 24 July 2021

ER -