Elevating Textual Question Answering with On-Demand Visual Augmentation

Research output: Contribution to journalArticlepeer-review

Abstract

Textual Question Answering (TQA) remains a formidable challenge, despite over a decade of research. The integration of transformer networks and external knowledge via pre-trained models has marked a significant advancement in TQA. Yet, a crucial element often overlooked is the incorporation of external visual understanding. In this study, we introduce an innovative TQA approach that equips machines with the capability for on-demand visual grounding, thereby enriching their comprehension of questions and enhancing the relevance of generated answers. Our methodology utilizes web image search to tap into a vast pool of global knowledge and employs a novel technique for determining the most appropriate answer through on-demand visual grounding. We present a variety of multimedia model configurations, showcasing that our proposed method not only surpasses existing systems without necessitating pre-training but also achieves performance comparable to fine-tuned models 30 times its size as well as closed-source LLMs such as GPT-4o, a testament to its efficiency. Furthermore, an interpretability analysis reveals the integral role of visual grounding in the model's decision-making process. This research offers a fresh outlook on augmenting TQA performance by harnessing the potential of visual grounding, with broad implications for natural language processing and artificial intelligence.

Original languageEnglish (US)
Article number287
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume21
Issue number10
DOIs
StatePublished - Oct 15 2025
Externally publishedYes

Keywords

  • Computer Vision
  • Multimedia Information Retrieval
  • Multimedia and multimodal retrieval
  • Multimodal Deep Learning
  • Natural Language Processing
  • On-Demand Data Augmentation
  • Question Answering
  • Visual Grounding
  • Visual Question Answering

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Elevating Textual Question Answering with On-Demand Visual Augmentation'. Together they form a unique fingerprint.

Cite this