ViTs (Vision Transformers)

Class of DL models that apply the transformer architecture, originally designed for natural language processing, to computer vision tasks.
 

Vision Transformers are an innovative adaptation of the transformer architecture to the domain of image analysis. Instead of relying on the convolutional layers typically used in image processing, ViTs divide an image into patches and treat these patches as sequences of input data, similar to words in a sentence. This approach allows ViTs to leverage the self-attention mechanism of transformers, enabling the model to focus on the most informative parts of an image across all patches. ViTs are particularly noted for their scalability and ability to perform well with larger datasets and increased computational power, achieving state-of-the-art results on many benchmark image recognition tasks.

Historical Overview: The concept of Vision Transformers was introduced in 2020 with a paper titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by researchers at Google. The model quickly gained popularity within the AI community for its innovative approach and strong performance on standard image recognition benchmarks.

Key Contributors: The development of Vision Transformers was led by Alexey Dosovitskiy and his team at Google Research. This group played a crucial role in demonstrating the effectiveness of the transformer architecture outside its original application in natural language processing, paving the way for further research and application in various fields of computer vision.