Abstract
We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.
Original language | English (US) |
---|---|
Pages (from-to) | 168-178 |
Number of pages | 11 |
Journal | Proceedings of SPIE - The International Society for Optical Engineering |
Volume | 5018 |
DOIs | |
State | Published - 2003 |
Event | Internet Imaging IV - Santa Clara, CA, United States Duration: Jan 21 2003 → Jan 22 2003 |
Keywords
- Aspect model
- Hierarchical clustering
- Learning image semantics
- Machine translation
- Object recognition
ASJC Scopus subject areas
- Electronic, Optical and Magnetic Materials
- Condensed Matter Physics
- Computer Science Applications
- Applied Mathematics
- Electrical and Electronic Engineering