Abstract
We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.
| Original language | English (US) |
|---|---|
| Journal | Proceedings of Machine Learning Research |
| Volume | 291 |
| State | Published - 2025 |
| Event | 38th Annual Conference on Learning Theory, COLT 2025 - Lyon, France Duration: Jun 30 2025 → Jul 4 2025 |
Keywords
- confidence bounds
- martingale
- offline contextual bandits
- second-order bounds
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Statistics and Probability
- Artificial Intelligence