TY - GEN
T1 - An Experimental Design Approach for Regret Minimization in Logistic Bandits
AU - Mason, Blake
AU - Jun, Kwang Sung
AU - Jain, Lalit
N1 - Publisher Copyright:
© 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2022/6/30
Y1 - 2022/6/30
N2 - In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant κ that can at worst scale exponentially with the norm of the unknown parameter θ∗. Prior works have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like O(dlog2(κ)√µT log(∣X ∣)) where d is the dimensionality, T is the time horizon, and µ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of O(√dµT log(∣X ∣)). Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on κ to log2(κ) for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where d2 lower order regret (cf., it is d for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to d.
AB - In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant κ that can at worst scale exponentially with the norm of the unknown parameter θ∗. Prior works have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like O(dlog2(κ)√µT log(∣X ∣)) where d is the dimensionality, T is the time horizon, and µ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of O(√dµT log(∣X ∣)). Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on κ to log2(κ) for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where d2 lower order regret (cf., it is d for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to d.
UR - http://www.scopus.com/inward/record.url?scp=85147655386&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147655386&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85147655386
T3 - Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022
SP - 7736
EP - 7743
BT - AAAI-22 Technical Tracks 7
PB - Association for the Advancement of Artificial Intelligence
T2 - 36th AAAI Conference on Artificial Intelligence, AAAI 2022
Y2 - 22 February 2022 through 1 March 2022
ER -