TY - GEN
T1 - Regret Analysis of Stochastic Multi-armed Bandit Problem with Clustered Information Feedback
AU - Zhao, Tianchi
AU - Jiang, Bo
AU - Li, Ming
AU - Tandon, Ravi
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - In this paper, we analyze the regret bound of Multi-armed Bandit (MAB) algorithms under the setting where the payoffs of an arbitrary-size cluster of arms are observable in each round. Compared to the well-studied bandit or full feedback setting, where the payoffs of the selected arm or all the arms are observable, the clustered feedback setting can be viewed as a generalization and a connection. We focus on two most representative MAB algorithms: Upper Confidence Bound and Thompson sampling, and adapt them into the clustered feedback setting. Then, we theoretically derive the regret bound for each of them considering the general type of payoffs (value comes from continuous domains). We show that the regret bounds of these two algorithms with clustered information feedback depend only on the number of clusters. Finally, we simulate both synthetic data and real-world data to compare the performance of these algorithms with different numbers of observable payoffs in each round, the results validate our analysis.
AB - In this paper, we analyze the regret bound of Multi-armed Bandit (MAB) algorithms under the setting where the payoffs of an arbitrary-size cluster of arms are observable in each round. Compared to the well-studied bandit or full feedback setting, where the payoffs of the selected arm or all the arms are observable, the clustered feedback setting can be viewed as a generalization and a connection. We focus on two most representative MAB algorithms: Upper Confidence Bound and Thompson sampling, and adapt them into the clustered feedback setting. Then, we theoretically derive the regret bound for each of them considering the general type of payoffs (value comes from continuous domains). We show that the regret bounds of these two algorithms with clustered information feedback depend only on the number of clusters. Finally, we simulate both synthetic data and real-world data to compare the performance of these algorithms with different numbers of observable payoffs in each round, the results validate our analysis.
KW - clustered information feedback
KW - multi-armed bandit
KW - problem-dependent bound.
KW - regret bound
UR - http://www.scopus.com/inward/record.url?scp=85093827809&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093827809&partnerID=8YFLogxK
U2 - 10.1109/IJCNN48605.2020.9207422
DO - 10.1109/IJCNN48605.2020.9207422
M3 - Conference contribution
AN - SCOPUS:85093827809
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2020 International Joint Conference on Neural Networks, IJCNN 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 International Joint Conference on Neural Networks, IJCNN 2020
Y2 - 19 July 2020 through 24 July 2020
ER -