Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated

Katherine Metcalf, Miguel Sarabia, Masha Fedzechkina, Barry John Theobald

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Preference-based Reinforcement Learning (PbRL) enables non-experts to train Reinforcement Learning models using preference feedback. However, the effort required to collect preference labels from real humans means that PbRL research primarily relies on synthetic labellers. We validate the most common synthetic labelling strategy by comparing against labels collected from a crowd of humans on three Deep Mind Control (DMC) suite tasks: stand, walk, and run. We find that: (1) the synthetic labels are a good proxy for real humans under some circumstances, (2) strong preference label agreement between human and synthetic labels is not necessary for similar policy performance, (3) policy performance is higher at the start of training from human feedback and is higher at the end of training from synthetic feedback, and (4) training on only examples with high levels of inter-annotator agreement does not meaningfully improve policy performance. Our results justify the use of synthetic labellers to develop and ablate PbRL methods, and provide insight into how human labelling changes over the course of policy training.

Original languageEnglish (US)
Title of host publicationTechnical Tracks 14
EditorsMichael Wooldridge, Jennifer Dy, Sriraam Natarajan
PublisherAssociation for the Advancement of Artificial Intelligence
Pages10128-10136
Number of pages9
Edition9
ISBN (Electronic)1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOIs
StatePublished - Mar 25 2024
Externally publishedYes
Event38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada
Duration: Feb 20 2024Feb 27 2024

Publication series

NameProceedings of the AAAI Conference on Artificial Intelligence
Number9
Volume38
ISSN (Print)2159-5399
ISSN (Electronic)2374-3468

Conference

Conference38th AAAI Conference on Artificial Intelligence, AAAI 2024
Country/TerritoryCanada
CityVancouver
Period2/20/242/27/24

ASJC Scopus subject areas

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Can You Rely on Synthetic Labellers in Preference-Based Reinforcement Learning? It's Complicated'. Together they form a unique fingerprint.

Cite this