Swin Transformer Hierarchical Vision Transformer|电子爱好者

admin管理员组
文章数量:1623605

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Tags: Swin Transformer
发表日期: 2021
星级 : ★★★★★
模型简写: Swin Transformer
简介: 多层次的Vision Transformer，提出基于窗口（移动窗口的多头自主意力机制）每次先做一次W-MSA, 再做一次SW-MSA
精读: Yes

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021最佳论文)

Transformer和CNN的完美结合

Motivation: unification story for AI (NLP and CV)，追求unified architecture (ViT更适用)

Unification: graph neural networks, self-attention

NLP, CV的统一建模

transformer: 基于图的建模

general representation与domain knowledge的结合

ViT：大力出奇迹

在ViT基础上结合CV characteristics (good priors for visual signals):

Hierarchy, locality, translation invariance

Introduction

general-purpose vision backbone 通用骨干网络（多尺寸特征），密集预测任务中多尺寸特征是至关重要的。

Linear computational complexity: The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.

Locality: shifted windows, local self-attention.

Hierarchy: Patch Merging.

shifted windows:

cross-window connection: 局部自注意力变相等于全局自注意力

Model Architecture

4个stage

Patch Partition：转为patch 224x224x3 → 56x56x48，下采样率4

Linear Embedding: 向量维度转为transformer可以接收的值，56x56x48 → 56x56x96 → 3136x96 C=96 / Patch Partition + Linear Embedding == Patch Projection (ViT)

Patch Merging: 下采样两倍，空间维度换通道数HxWxC → H/2W/2x4C → H/2W/2*2C; 空间大小减半，通道数加倍，跟卷积网络对应。56x56x96 → 28x28x192 self-attention+Patch Mergin == CNN + Pooling

Architecture Variants:

Swin-T: C=96, layer numbers = {2, 2, 6, 2} Tiny == resnet50

Swin-S: C=96, layer numbers = {2, 2, 6, 2} small == resnet101

Swin-B: C=96, layer numbers = {2, 2, 6, 2}

Swin-L: C=96, layer numbers = {2, 2, 6, 2}

shifted Window based Self-Attention

The global computation leads to quadratic complexity with respect to the number of tokens. 全局自注意力机制会导致平方倍的计算复杂度。

Compute self-attention within local windows.

Ω ( M S A ) = 4 h w C 2 + 2 ( h w ) 2 C Ω ( W M S A ) = 4 h w C 2 + 2 M 2 h w C \Omega(MSA)=4hwC^2+2(hw)^2C\\\Omega(WMSA)=4hwC^2+2M^2hwC Ω(MSA)=4hwC2+2(hw)2CΩ(WMSA)=4hwC2+2M2hwC

Window bases self-attention比Global self-attention计算复杂度低，但是却丧失了全局建模的能力。

Shifted window partitioning in successive blocks

z ^ l = W M S A ( L N ( z l − 1 ) ) + z l − 1 z l = M L P ( L N ( z ^ l ) ) + z ^ l z ^ l + 1 = W M S A ( L N ( z l ) ) + z l z l + 1 = M L P ( L N ( z ^ l + 1 ) ) + z ^ l + 1 \hat{z}^l=WMSA(LN(z^{l-1}))+z^{l-1}\\z^l=MLP(LN(\hat{z}^l))+\hat{z}^l\\\hat{z}^{l+1}=WMSA(LN(z^l))+z^l\\z^{l+1}=MLP(LN(\hat{z}^{l+1}))+\hat{z}^{l+1} z^l=WMSA(LN(zl−1))+zl−1zl=MLP(LN(z^l))+z^lz^l+1=WMSA(LN(zl))+zlzl+1=MLP(LN(z^l+1))+z^l+1

Efficient batch computation approach for self-attention in shifted window partitioning

Mask in shifted window attention (Masked MSA, masked multi-head self-attention) 七巧板？

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ekd8FGTq-1659576646165)(Swin%20Transformer%20Hierarchical%20Vision%20Transformer%20u%20a786a9434ac34e75b596da0d14d7a0c3/Untitled%204.png)]

掩码可视化：（self-attention结果与mask相加）

Experiments

pretrained on

ImageNet1k : 1.28M images. 1k classes

ImageNet22k: 14.2M images, 22k classes

3 datasets to cover various recognition tasks of different granularities

Image-level : ImageNet-1K classification (1.28million images; 1000 classes)

Region-level: COCO object detection (115K images, 80 classes)

Pixel-level: ADE20K semantic segmentation (20K images; 150 classes)

3 levels of comparison

System-level comparisons （不追求公平比较，极致性能，MMA）

Backbone-level comparison

Verify the effectiveness of crucial designs

s实验部分是顶级的，全方位碾压之前的模型

统一性做到极致甚至是可以两个模态共享模型参数，而不只是架构一样，当然通常来说，不一定要做到这个程度，例如处理裸的信号时，前面几层通常时不用share的。更多应用里，像Swin这样采用Transformer这样的模块，已经能将两个模态的训练方式统一起来，并相互借鉴经验，已经很好了。另一方面，Swin里的一些特性，其实还可以反过来用到NLP里，这样也是可以达成您提到的统一性的。

Swin Transformer V2

Swin Transformer V2: Scaling Up Capacity and Resolution

pre-norm下，激活层差异随着模型加深变大，与浅层特征有很大的gap，并导致训练的不稳定性；
使用余弦相似度代替内积相似度，改善因为某些特征过大而主导attention的情况，因为余弦函数本书就相当于归一化的结果
log-spaced CPB，有助于windows-size的扩展。

基于Swin Transformer的3个改进点

Post-normaliztion后归一化技术，在self-attention layer和MLP block后进行layer normalization
Scaled cosine attention代替dot production attention，使用余弦相似度计算token pair之间的关系
Log-spaced continuous position bias，对数空间连续位置偏置技术

scaled cosine attention：

S i m ( q i , k i ) = c o s ( q i , k i ) τ + B i j Sim(q_i,k_i)=cos(q_i,k_i)\tau+B_{ij} Sim(qi,ki)=cos(qi,ki)τ+Bij

Log-spaced CPB

Motivate: Degraded performance when transferring the models across window resolutions. On a larger image, window size by the bi-cubic interpolation approach, the accuracy significantly drops.

continuous relative position bias:

B ( Δ x , Δ y ) = φ ( Δ x , Δ y ) B(\Delta{x},\Delta{y})=\varphi(\Delta{x},\Delta{y}) B(Δx,Δy)=φ(Δx,Δy)

Δ x ^ = s i g n ( x ) ) ⋅ l o g ( 1 + Δ x ) Δ y ^ = s i g n ( y ) ) ⋅ l o g ( 1 + Δ y ) \hat{\Delta{x}}=sign(x))\cdot{log(1+\Delta{x})}\\\hat{\Delta{y}}=sign(y))\cdot{log(1+\Delta{y})} Δx^=sign(x))⋅log(1+Δx)Δy^=sign(y))⋅log(1+Δy)

本文标签： Transformer Swin Vision Hierarchical

版权声明：本文标题：Swin Transformer Hierarchical Vision Transformer 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dongtai/1728891784a1178136.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

更多相关文章

xp系统

电子爱好者 - 最新技术资讯及电子产品介绍！

Swin Transformer Hierarchical Vision Transformer

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Introduction

Model Architecture

shifted Window based Self-Attention

Shifted window partitioning in successive blocks

Efficient batch computation approach for self-attention in shifted window partitioning

Experiments

Swin Transformer V2

更多相关文章

“A Hierarchical Latent Structure for Variational Conversation Modeling“ (NAACL 2018 Oral) 程序复现经历

Structure-Aware Transformer for Graph Representation Learning

Structure-Aware Transformer for Graph Representation Learning 简单笔记

【论文阅读】Hierarchical line matching based on Line–Junction–Line structure

图解Transformer | The Illustrated Transformer

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction论文笔记

[SAM]A Comprehensive Survey on Segment Anything Model for Vision and Beyond

Transformer整体结构代码详解

基于Transformer的翻译模型（英-＞中）

Transformer的position embedding

《Robotics, Vision and Control — Fundamental Algorithms in MATLAB》第二章课后习题

Robotics, Vision and Control (Fundamental Algorithms in MATLAB)学习笔记 01_第一章 绪论

《Robotics, Vision and Control — Fundamental Algorithms in MATLAB》第三章课后习题

Introduction to computer vision( ud810, CS 6476)--epipolar geometry, fundamental matrix

[Computer Vision 4] Distinctive Image Features from Scale-Invariant Keypoints

Transformer(一)--论文翻译：Attention Is All You Need 中文版

P11-Transformer学习1.1-《Attention Is All You Need》

Transformer论文解读和Bert模型架构

Bi-Branch Vision Transformer Network for EEG Emotion Recognition论文翻译

译(Transformer) NIPS-2017 ---Attention Is All You Need

发表评论

推荐文章

Android安全与隐私相关特性的行为变更分析

Windows系统中如何释放C盘空间

设计模式-行为型-中介者模式

Llama3-Tutorial之LMDeploy高效部署Llama3实践

信创操作系统--麒麟Kylin桌面版（项目五 软件管理：应用商店、deb包安装、包管理器）

热门文章

爬虫实战（一）—利用requests、mongo、redis代理池爬取英雄联盟opgg实时英雄数据

Arduino--ESP8266物联网WIFI模块（贝壳物联）--数据上传服务器（单数据接口）

什么办法可以恢复手机删除照片

linux下用dd命令拷贝硬盘

Zotero 5.0 + 坚果云同步盘 + papership 配置教程

【windows】win11 微软商店（MicrosoftStore）默认安装路径

按键大师：用Python实现无人值守的自动化操作

C#设计模式 ---- 总结汇总

优先队列默认是小顶堆吗_STL 之 priority_queue 优先级队列

应用商店第一课--软件的架构

最新文章

Ubuntu Snap商店代理设置方法

如何重装win10应用商店？

Android 设备在Play Store无法搜索下载Netflix软件

为什么手机下载的软件卸载了,却还是显示已安装

Win10微软商店怎么改中文？

解决Ubuntu 16.04 的应用商店卸载或加载不出来的教程

苹果iOS第三方应用商店，App store没有的这都有！发烧友赶紧来！

苹果6系统怎么更新不了_苹果12app下载不了软件怎么办-苹果12AppStore下载不了软件解决方法...

kali 安装星火应用商店（半成品

xposed框架android9.0,xposed仓库商店下载

linux下使用第三方商店安装应用

android实现应用商店开发,基于Android平台的应用商店客户端的设计与实现

苹果下载不了软件怎么办？手把手带你搞定

删除android电视软件下载,【教程】无需root！卸载小米电视盒子内置应用竟如此简单...

解决Ubuntu20软件中心空白、搜不到软件的问题

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

Robotics, Vision and Control (Fundamental Algorithms in MATLAB)学习笔记 01_第一章绪论

信创操作系统--麒麟Kylin桌面版（项目五软件管理：应用商店、deb包安装、包管理器）

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载