Call for Papers: Special Issue on Transformer Models in Vision
TPAMI seeks submissions for this upcoming special issue.

Transformer models have recently demonstrated exemplary performance on a broad range of language tasks such as text classification, machine translation, and question answering. These breakthroughs in the natural language processing (NLP) domain have sparked great interest in the computer vision community to investigate these models for vision and multi-modal learning tasks. However, visual data follows a typical structure (such as spatial and temporal coherence), thus demanding novel network designs and training schemes. As a result, transformer models and their variants have been successfully used for image recognition, object detection, segmentation, image super-resolution, video understanding, image generation, text-image synthesis and visual question answering.

Among their salient benefits, transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequences, as compared to recurrent networks such as long short-term memory (LSTM). Different from convolutional neural networks, transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the relatively straightforward design of transformers allows processing multiple modalities (such as images, videos, text, and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets.

This special issue seeks original contributions towards advancing the theory, architecture, and algorithmic design for transformer models in computer vision, as well as novel applications and use cases. We envision original and well-motivated adaptations of transformer models for vision tasks and efforts towards improving their accuracy, robustness, and efficiency. The special issue will provide a timely collection of recent advances to benefit the researchers and practitioners working in the broad research field of computer vision, pattern analysis, and machine intelligence. Topics of interest include (but are not limited to):

  • Theoretical insights into transformer-based models
  • Efficient transformer architectures, including novel mechanisms for self-attention
  • Novel transformer models for spatial (image) and temporal (video) data modeling
  • Visualizing and interpreting transformer networks
  • Generative models for transformer networks
  • Hybrid network designs combining the strengths of transformer models with convolutional and graph-based models
  • Unsupervised, weakly supervised, and semi-supervised learning with transformer models
  • Multi-modal learning combining visual data with text, speech, and knowledge graphs
  • Leveraging multi-spectral data like satellite imagery and infrared images in transformer models for improved semantic understanding of visual content
  • Transformer-based designs for low-level vision problems such as image super- resolution, deblurring, de-raining, and denoising
  • Novel transformer-based methods for high-level vision problems such as object detection, segmentation, activity recognition, and pose estimation
  • Transformer models for volumetric, mesh, and point-cloud data processing in 3D and 4D data regimes

Important Dates

Open for submissions: 15 October 2021
Submissions due: 15 January 2022
Preliminary notification: 15 March 2022
Revisions due: 15 May 2022
Final notification: 30 June 2022
Publication (tentative): November 2022

Submission Guidelines

For author information and guidelines on submission criteria, visit the Author Information page. Please submit papers through the ScholarOne system, and be sure to select the special-issue name. Manuscripts should not be published or currently submitted for publication elsewhere. Please submit only full papers intended for review, not abstracts, to the ScholarOne portal.


Contact the guest editors:

  • Ashish Vaswani, Research Scientist, Google Brain (USA)
  • Fahad Shahbaz Khan, Associate Professor, Mohamed Bin Zayed University of Artificial Intelligence (UAE), Linköping
    University (Sweden)
  • Ming-Hsuan Yang, Professor University of California, Merced (USA); Research Scientist, Google (USA)
  • Mubarak Shah, Trustee Chair Professor, University of Central Florida (USA)
  • Niki Parmar, Research Scientist, Google Research (USA)
  • Salman Khan, Assistant Professor, Mohamed Bin Zayed University of Artificial Intelligence (UAE), Australian National University (Australia)
Special Issue on Transformer Models in Vision
15 January 2022