META's Hiera: Streamlining Complexity for Enhanced Accuracy

Chapter 1 Understanding Vision Transformers

For over two decades, convolutional networks have been the gold standard in computer vision. However, the introduction of transformers led many to speculate that CNNs would be rendered obsolete. Yet, convolution-based models continue to be utilized in various projects. Why is this the case?

This article aims to address the following questions: What exactly are Vision Transformers? What limitations do they face? Can these limitations be addressed? How does META's Hiera manage to excel?

Section 1.1 The Rise of Vision Transformers

Vision Transformers (ViTs) have emerged as leaders in recent vision benchmarks. But what are they really? Until a few years ago, convolutional neural networks (CNNs) were the primary choice for visual tasks. The release of the transformer model in 2017 revolutionized the NLP field, demonstrating that self-attention models could outperform traditional RNNs and LSTMs. This led to a natural inquiry: could transformers also be applied effectively to images?

Prior to 2020, attempts to integrate self-attention into models showed limited success. Researchers sought a method to utilize transformers natively with images. In 2020, Google proposed a solution that involved breaking images into patches, treating them as sequences akin to tokens in text.

Section 1.2 Advantages of Vision Transformers

As CNNs gradually lose their dominance in computer vision, Vision Transformers are proving their superiority on benchmarks like ImageNet. When provided with sufficient data, ViTs have demonstrated they can outperform CNNs. Despite their differences, both ViTs and CNNs create intricate, layered representations. However, ViTs excel at leveraging background information and are generally more resilient.

Another significant advantage of transformers is their scalability. Over the years, we've seen CNNs with millions of parameters, while ViTs have reached billions. Recently, Google showcased the potential to scale ViTs up to 20 billion parameters, hinting at even larger models in the future.

Chapter 2 The Limitations of Vision Transformers

While ViTs have made impressive strides, they are not without their challenges. The inherent design of transformers leads to inefficient use of parameters due to consistent spatial resolution and channel numbers throughout the network.

CNNs, inspired by the human visual cortex, effectively reduce spatial resolution and increase channel variety as they progress through layers. Conversely, transformers utilize a sequence of self-attention blocks which, while adept at generalization, face hurdles when applied to images—particularly in object detection where scale varies.

Efforts to address these issues include leveraging hierarchical feature maps, as seen in the Swin Transformer, which builds a hierarchical representation starting from small patches and progressively merging them.

Section 2.1 Innovations in Vision Transformers

Complex adjustments have been proposed over time to enhance ViTs' performance, but these often slow down training. Can we tackle transformer limitations without resorting to intricate solutions? Recent advancements have focused on simplifying models and expediting training, with the introduction of sparsity being a notable method. One successful model in computer vision utilizing this is the Masked Autoencoder (MAE).

In this approach, images are segmented into patches, with some patches masked out. The decoder is then tasked with reconstructing the original image from the unmasked portions, allowing the encoder to process only a fraction of the patches. This enables the training of expansive encoders with minimal computational and memory requirements.

Subsection 2.1.1 Addressing the Complexity Challenge

However, while sparsity enhances training efficiency, it conflicts with the hierarchical advantage of CNNs. Previous models attempting to combine these concepts struggled, resulting in slow performance or unnecessary complexity.

Is it feasible to create a model that is both sparse and hierarchical yet efficient?

Chapter 3 Introducing Hiera: A Game-Changer

META's latest work departs from traditional MAE training methods, aiming to construct a Vision Transformer that is both efficient and accurate without the complexities of previous models.

The foundational concept is that achieving high accuracy in visual tasks with a hierarchical ViT does not require a multitude of intricate components. The authors assert that spatial relationships can be effectively learned through MAE training.

By utilizing an existing hierarchical ViT framework, MViTv2, the team repurposed it with MAE training, introducing several improvements while removing non-essential complexities.

The authors demonstrate that the modifications lead to significant improvements in both accuracy and processing speed. Hiera is reported to be 2.4 times faster for images and 5.1 times faster for video processing compared to its predecessor MViTv2, while also achieving greater accuracy.

Results indicate that even with a limited parameter count, Hiera performs admirably on critical datasets like Imagenet 1K. This reinforces the idea that spatial biases can be learned during training, making ViTs competitive against convolutional networks, even at smaller scales.

The authors further validate Hiera’s effectiveness through transfer learning on datasets such as iNaturalists and Places, showcasing its superiority over previous ViTs.

Parting Thoughts

This research highlights the potential of a simplified hierarchical vision transformer, which eliminates unnecessary complexities while enhancing speed and accuracy for both image and video tasks. With many in the community still relying on convolutional models, Hiera's efficiency could be transformative.

The shift towards leveraging sparsity for training—reducing parameters and accelerating processing—is also a growing trend across various AI domains, suggesting exciting avenues for future exploration.

If you found this topic intriguing, you may wish to explore my GitHub repository, which will include a variety of resources related to machine learning and artificial intelligence.

References

Chaitanya Ryali et al, 2023, Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles.
Peng Gao et al, 2022, MCMAE: Masked Convolution Meets Masked Autoencoders.
Xiang Li et al, 2022, Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality.
Zhenda Xie et al, 2022, SimMIM: A Simple Framework for Masked Image Modeling.
Ze Liu et al, 2021, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
Haoqi Fan et al, 2021, Multiscale Vision Transformers.
Kaiming He et al, 2021, Masked Autoencoders Are Scalable Vision Learners.
Chen Wei et al, 2021, Masked Feature Prediction for Self-Supervised Visual Pre-Training.
Alexey Dosovitskiy et al, 2020, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Ashish Vaswani et al, 2017, Attention Is All You Need.
Kaiming He et al, 2015, Deep Residual Learning for Image Recognition.
Wei Yu et al, 2014, Visualizing and Comparing Convolutional Neural Networks.
Karen Simonyan et al, 2014, Very Deep Convolutional Networks for Large-Scale Image Recognition.
Why Do We Have Huge Language Models and Small Vision Transformers?, TDS.
A Visual Journey in What Vision-Transformers See, TowardsAI.
Vision Transformer, paperswithcode.

whalebeings.com

META's Hiera: Streamlining Complexity for Enhanced Accuracy

Chapter 1 Understanding Vision Transformers

Section 1.1 The Rise of Vision Transformers

Section 1.2 Advantages of Vision Transformers

Chapter 2 The Limitations of Vision Transformers

Section 2.1 Innovations in Vision Transformers

Subsection 2.1.1 Addressing the Complexity Challenge

Chapter 3 Introducing Hiera: A Game-Changer

Parting Thoughts

References

Share the page:

Recent Post:

Unlocking the Invisible Potential: Insights from Michael Gervais

Ingenious MacGyver Science: Eclipse, Filters, and DIY Projects

Engaging Stories and Insights from ILLUMINATION Publications

Creative Approaches to Achieving 19: Think Outside the Box

Understanding dbt, Jinja, and Data Warehousing Tools

How Social Media Shapes Our Perception of Body Image

Navigating Relationships: Understanding the Modern Fuckboy

Celebrate Pi Day: Fun Facts and Activities for Math Lovers