Recognition as translating images into text

Kobus Barnard, Pinar Duygulu, David Forsyth

Research output: Contribution to journalConference articlepeer-review

4 Scopus citations


We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.

Original languageEnglish (US)
Pages (from-to)168-178
Number of pages11
JournalProceedings of SPIE - The International Society for Optical Engineering
StatePublished - 2003
EventInternet Imaging IV - Santa Clara, CA, United States
Duration: Jan 21 2003Jan 22 2003


  • Aspect model
  • Hierarchical clustering
  • Learning image semantics
  • Machine translation
  • Object recognition

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Condensed Matter Physics
  • Computer Science Applications
  • Applied Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Recognition as translating images into text'. Together they form a unique fingerprint.

Cite this