Skip to main navigation Skip to search Skip to main content

SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

  • Zhenglun Kong
  • , Peiyan Dong
  • , Xiaolong Ma
  • , Xin Meng
  • , Wei Niu
  • , Mengshu Sun
  • , Xuan Shen
  • , Geng Yuan
  • , Bin Ren
  • , Hao Tang
  • , Minghai Qin
  • , Yanzhi Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Considering the computation complexity, the internal data pattern of ViTs, and the edge device deployment, we propose a latency-aware soft token pruning framework, SPViT, which can be set up on vanilla Transformers of both flatten and hierarchical structures, such as DeiTs and Swin-Transformers (Swin). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens chosen by the selector module into a package token rather than discarding them completely. SPViT is bound to the trade-off between accuracy and latency requirements of specific edge devices through our proposed latency-aware training strategy. Experiment results show that SPViT significantly reduces the computation cost of ViTs with comparable performance on image classification. Moreover, SPViT can guarantee the identified model meets the latency specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile devices. For example, SPViT reduces the latency of DeiT-T to 26 ms (26%−41% superior to existing works) on the mobile device with 0.25%−4% higher top-1 accuracy on ImageNet. Our code is released at https://github.com/PeiyanFlying/SPViT.

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, Proceedings
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Science and Business Media Deutschland GmbH
Pages620-640
Number of pages21
ISBN (Print)9783031200823
DOIs
StatePublished - 2022
Externally publishedYes
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: Oct 23 2022Oct 27 2022

Publication series

NameLecture Notes in Computer Science
Volume13671 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th European Conference on Computer Vision, ECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period10/23/2210/27/22

Keywords

  • FPGA
  • Hardware acceleration
  • Mobile devices
  • Model compression
  • Vision transformer

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning'. Together they form a unique fingerprint.

Cite this