Abstract
Synthetic data is used to increase a dataset's size for machine learning when acquiring new data is difficult. However, this is difficult for text data due to its symbolic nature. Large language models have decreased this difficulty. Using autism spectrum disorder as a use case, we analyze how synthetic data chosen based on descriptive metrics impacts the performance of a downstream classifier. We leverage a finetuned multilabel, bidirectional encoder model to label textual descriptions of children's behaviors (N = 10892) with seven diagnostic criteria. We measure precision, recall, and F1 per label to compare the impact of augmentation schemes. We evaluate performance without augmentation (baseline), then compare data source (original versus synthetic), amount of data added (50% or 100% of baseline count), and method of augmentation (adding to one class via Data Targeting or to the entire dataset via Data Scaling). The data points were selected based on scores from our white-box metrics: type-token ratio, cosine similarity, and perplexity. We also conducted a qualitative evaluation of the data using expert feedback. This resulted in a consistent increase in recall (approximately 8 %) but a similarly consistent decrease in precision (approximately 10 %). Neither the white-box metrics nor the following standard-deviation-based stability analysis provided a clear relationship to our results in our model. Cost analysis showed that data targeting could lower the BioBERT model's cost. Overall, this study shows that different schemes should be favored depending on the intent of use, e.g., screening or diagnosing in medicine.
| Original language | English (US) |
|---|---|
| Article number | 128151 |
| Journal | Expert Systems With Applications |
| Volume | 286 |
| DOIs | |
| State | Published - Aug 15 2025 |
| Externally published | Yes |
Keywords
- BERT
- Cost analysis
- Data augmentation
- LLM
- Large language models
- Stability analysis
- Synthetic data
- Text data
- explainable AI
ASJC Scopus subject areas
- General Engineering
- Computer Science Applications
- Artificial Intelligence