Metadata Enhancement Using Large Language Models

Hyunju Song, Steven Bethard, Andrea K. Thomer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the natural sciences, a common form of scholarly document is a physical sample record, which provides categorical and textual metadata for specimens collected and analyzed for scientific research. Physical sample archives like museums and repositories publish these records in data repositories to support reproducible science and enable the discovery of physical samples. However, the success of resource discovery in such interfaces depends on the completeness of the sample records. We investigate approaches for automatically completing the scientific metadata fields of sample records. We apply large language models in zero and few-shot settings and incorporate the hierarchical structure of the taxonomy. We show that a combination of record summarization, bottom-up taxonomy traversal, and few-shot prompting yield an F1 score as high as 0.928 on metadata completion in the Earth science domain.

Original languageEnglish (US)
Title of host publicationSDP 2024 - 4th Workshop on Scholarly Document Processing, Proceedings of the Workshop
EditorsTirthankar Ghosal, Amanpreet Singh, Anita de Waard, Philipp Mayr, Aakanksha Naik, Orion Weller, Yoonjoo Lee, Shannon Shen, Yanxia Qin
PublisherAssociation for Computational Linguistics (ACL)
Pages145-154
Number of pages10
ISBN (Electronic)9798891761513
StatePublished - 2024
Event4th Workshop on Scholarly Document Processing, SDP 2024 at ACL 2024 - Bangkok, Thailand
Duration: Aug 16 2024 → …

Publication series

NameSDP 2024 - 4th Workshop on Scholarly Document Processing, Proceedings of the Workshop

Conference

Conference4th Workshop on Scholarly Document Processing, SDP 2024 at ACL 2024
Country/TerritoryThailand
CityBangkok
Period8/16/24 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Metadata Enhancement Using Large Language Models'. Together they form a unique fingerprint.

Cite this