Enhancing Text Datasets With Scaling and Targeting Data Augmentation to Improve BERT-Based Machine Learners

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Synthetic data is used to increase a dataset's size for machine learning when acquiring new data is difficult. However, this is difficult for text data due to its symbolic nature. Large language models have decreased this difficulty. Using autism spectrum disorder as a use case, we analyze how synthetic data chosen based on descriptive metrics impacts the performance of a downstream classifier. We leverage a finetuned multilabel, bidirectional encoder model to label textual descriptions of children's behaviors (N = 10892) with seven diagnostic criteria. We measure precision, recall, and F1 per label to compare the impact of augmentation schemes. We evaluate performance without augmentation (baseline), then compare data source (original versus synthetic), amount of data added (50% or 100% of baseline count), and method of augmentation (adding to one class via Data Targeting or to the entire dataset via Data Scaling). The data points were selected based on scores from our white-box metrics: type-token ratio, cosine similarity, and perplexity. We also conducted a qualitative evaluation of the data using expert feedback. This resulted in a consistent increase in recall (approximately 8 %) but a similarly consistent decrease in precision (approximately 10 %). Neither the white-box metrics nor the following standard-deviation-based stability analysis provided a clear relationship to our results in our model. Cost analysis showed that data targeting could lower the BioBERT model's cost. Overall, this study shows that different schemes should be favored depending on the intent of use, e.g., screening or diagnosing in medicine.

Original languageEnglish (US)
Article number128151
JournalExpert Systems With Applications
Volume286
DOIs
StatePublished - Aug 15 2025
Externally publishedYes

Keywords

  • BERT
  • Cost analysis
  • Data augmentation
  • LLM
  • Large language models
  • Stability analysis
  • Synthetic data
  • Text data
  • explainable AI

ASJC Scopus subject areas

  • General Engineering
  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Enhancing Text Datasets With Scaling and Targeting Data Augmentation to Improve BERT-Based Machine Learners'. Together they form a unique fingerprint.

Cite this