Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

Research output: Contribution to journalConference articlepeer-review

Abstract

We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

Original languageEnglish (US)
JournalProceedings of Machine Learning Research
Volume291
StatePublished - 2025
Event38th Annual Conference on Learning Theory, COLT 2025 - Lyon, France
Duration: Jun 30 2025Jul 4 2025

Keywords

  • confidence bounds
  • martingale
  • offline contextual bandits
  • second-order bounds

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing'. Together they form a unique fingerprint.

Cite this