Skip to main navigation Skip to search Skip to main content

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

  • Rongyu Zhang
  • , Zefan Cai
  • , Huanrui Yang
  • , Zidong Liu
  • , Denis Gudovskiy
  • , Tomoyuki Okuno
  • , Yohei Nakata
  • , Kurt Keutzer
  • , Baobao Chang
  • , Yuan Du
  • , Li Du
  • , Shanghang Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. The conventional finetuning process with the randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision- languag e C ollaborative A ctive F inetuning (VeCAF). VeCAF optimizes a parametric data selection model by incorporating the training objective of the model being tuned. Effectively, this guides the PVM towards the performance goal with improved data and computational efficiency.With the ever-growing feasibility of acquiring labels and natural language annotations of image data through web-scale crawling, we exploit the inherent semantic richness of the text embedding space and utilize text embeddings of image annotations to augment PVM image features for better data selection and finetuning. Furthermore, the flexibility of text-domain augmentation gives VeCAF the unique ability to handle out-of-distribution scenarios without external augmented data. Extensive experiments show the leading performance and high efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF needs up to 3.3× less training batches to reach the target performance compared to full fine-tuning and achieves an accuracy improvement of 2.8% over active SOTA fine-tuning methods with the same number of batches. Our code is now available at https://github.com/RoyZry98/VeCAF-Pytorch.

Original languageEnglish (US)
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages5451-5459
Number of pages9
ISBN (Electronic)9798400706868
DOIs
StatePublished - Oct 28 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: Oct 28 2024Nov 1 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period10/28/2411/1/24

Keywords

  • active learning
  • fine-tuning
  • vision-language models

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness'. Together they form a unique fingerprint.

Cite this