FAM: Improving Columnar Vision Transformer with Feature Attention Mechanism

Lan Huang, Xingyu Bai, Jia Zeng, Mengqiang Yu, Wei Pang, Kangping Wang

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)
20 Downloads (Pure)

Abstract

Vision Transformer has garnered outstanding performance in visual tasks due to its capability for global modeling of image information. However, during the self-attention computation of image tokens, a common issue of attention map homogenization arises, impacting the final performance of the model as attention maps propagate through feature maps layer by layer. In this research, we propose a token-based approach to adjust the output of attention sub-layer, focusing on the feature dimensions, to address the homogenization problem. Furthermore, different network architectures exhibit variations in their approaches to modeling image features. Specifically, Vision Transformers excel at modeling long-range relationships, while convolutional neural networks possess local receptive fields. Therefore, this paper introduces a plug-and-play convolutional operator-based component, integrated into the Vision Transformer, to validate the impact of structural enhancements on model performance. Experimental results on image recognition and adversarial attack tasks respectively demonstrate the effectiveness and robustness of the two proposed methods. Additionally, the analysis of information entropy on the feature maps of the model’s final layer indicates that the improved model exhibits higher information richness, making it more conducive to the classifier’s discriminative capabilities.
Original languageEnglish
Article number103981
JournalComputer Vision and Image Understanding
Volume242
Early online date6 Mar 2024
DOIs
Publication statusPublished - May 2024

Keywords

  • Feature adjustment
  • Network structure improvement
  • Vision transformer

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'FAM: Improving Columnar Vision Transformer with Feature Attention Mechanism'. Together they form a unique fingerprint.

Cite this