Abstract
Textual Question Answering (TQA) remains a formidable challenge, despite over a decade of research. The integration of transformer networks and external knowledge via pre-trained models has marked a significant advancement in TQA. Yet, a crucial element often overlooked is the incorporation of external visual understanding. In this study, we introduce an innovative TQA approach that equips machines with the capability for on-demand visual grounding, thereby enriching their comprehension of questions and enhancing the relevance of generated answers. Our methodology utilizes web image search to tap into a vast pool of global knowledge and employs a novel technique for determining the most appropriate answer through on-demand visual grounding. We present a variety of multimedia model configurations, showcasing that our proposed method not only surpasses existing systems without necessitating pre-training but also achieves performance comparable to fine-tuned models 30 times its size as well as closed-source LLMs such as GPT-4o, a testament to its efficiency. Furthermore, an interpretability analysis reveals the integral role of visual grounding in the model's decision-making process. This research offers a fresh outlook on augmenting TQA performance by harnessing the potential of visual grounding, with broad implications for natural language processing and artificial intelligence.
| Original language | English (US) |
|---|---|
| Article number | 287 |
| Journal | ACM Transactions on Multimedia Computing, Communications and Applications |
| Volume | 21 |
| Issue number | 10 |
| DOIs | |
| State | Published - Oct 15 2025 |
| Externally published | Yes |
Keywords
- Computer Vision
- Multimedia Information Retrieval
- Multimedia and multimodal retrieval
- Multimodal Deep Learning
- Natural Language Processing
- On-Demand Data Augmentation
- Question Answering
- Visual Grounding
- Visual Question Answering
ASJC Scopus subject areas
- Hardware and Architecture
- Computer Networks and Communications