TinyViT: Fast Pretraining Distillation for Vision Transformers

作者:JC2023.10.08 07:05浏览量:3

简介:TinyViT: Fast Pretraining Distillation for Small Vision Transformers

千帆应用开发平台“智能体Pro”全新上线 限时免费体验

面向慢思考场景,支持低代码配置的方式创建“智能体Pro”应用

立即体验

TinyViT: Fast Pretraining Distillation for Small Vision Transformers
As the amount of available data continues to grow, the use of large-scale pretrained models has shown impressive results in various tasks, particularly in the field of natural language processing. However, in the realm of computer vision, the application of such models has been challenging due to their prohibitive memory footprint and computational cost. To address this issue, several studies have explored the use of smaller variants of vision transformers (ViTs), but their performance has generally lagged behind that of their larger counterparts. In this paper, we propose TinyViT, a novel fast pretraining distillation framework for small ViTs that effectively addresses this gap.
TinyViT is built on the concept of knowledge distillation, a process of transferring the knowledge from a large, cumbersome teacher model to a smaller student model. The key advantage of this approach is that it enables us to capture the salient features of the teacher model while significantly reducing the computational requirements of the student model. Typically, distillation involves fine-tuning the parameters of the student model on a labeled dataset using the soft predictions of the teacher model as supervision. However, this approach can be computationally expensive and time-consuming.
To address this issue, we propose a novel fast pretraining distillation framework for small ViTs. Our approach begins with initializing the parameters of the student model using a pretrained teacher model. Subsequently, we use self-supervision and distillation loss to update the parameters of the student model solely based on unlabeled data. The self-supervision aspect encourages the student model to learn transferable features by predicting its own outputs (or a related task) while the distillation loss penalizes the student model for deviated predictions from the teacher model. This framework not only enables us to capture the knowledge from the teacher model but also allows us to circumvent the need for labeled data, thereby significantly reducing the computational cost and time required for training.
We evaluate TinyViT on several benchmark datasets for image classification and demonstrate its effectiveness by comparing it with existing state-of-the-art methods. Our experiments show that TinyViT outperforms its competitors with a significant margin while using a fraction of the computational resources. Specifically, on ImageNet, our method achieves a top-1 accuracy of 75.3%, which is 24% higher than the current state-of-the-art method that does not use pretraining and 16% higher than the one that uses pretraining. These results establish TinyViT as a competitive alternative to existing methods for computer vision tasks.
In summary, we present TinyViT, a novel fast pretraining distillation framework for small vision transformers that effectively addresses the challenges associated with their large memory footprint and computational cost. By utilizing our framework, we can achieve state-of-the-art performance using a fraction of the resources without sacrificing accuracy. We believe that TinyViT paves a new path forward for efficient vision transformer-based models and opens up exciting opportunities for future research in this area.

article bottom image

相关文章推荐

发表评论

图片