“BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION”笔记|电子爱好者

admin管理员组
文章数量:1564187

BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION

1. INTRODUCTION

Based on the second-order analysis, we define a set of reconstruction units and show that block reconstruction is the best choice with the support from theoretical and empirical evidence. We also use Fisher Information Matrix to assign each pre-activation with an importance measure during reconstruction.
We incorporate genetic algorithm and the well-defined intra-block sensitivity measure to generate latency and size guaranteed mixed precision quantized neural networks, which fulfills a general improvement on both specialized hardware (FPGA) and general hardware (ARM CPU).
We conduct extensive experiments to verify our proposed methods. We find that our method is applicable to a large variety of tasks and models. Moreover, we show that post-training quantization can quantize weights to INT2 without significant accuracy loss for the first time.

2 PRELIMINARIES

3 EXPERIMENTS

3.1 CROSS-LAYER DEPENDENCY

3.2 BLOCK RECONSTRUCTION

paragraph 1

提出问题：Although the network output reconstruction has an accurate estimation of the second-order error, we find in practice it is worse than the layer-by-layer reconstruction in PTQ.（对network output进行重构虽然对二阶误差有准确的估计但实际效果不如逐层）

分析原因：The primary reason for this is optimizing the whole networks over 1024 calibration data samples leads to over-fitting easily.

引用相关论文继续解释：As Jakubovitz et al. (2019) explained, the networks can have perfect expressivity when the number of parameters exceeds the number of data samples during training, but lower training error does not ensure lower test error.

对 layer-wise reconstruction做出类比，因此可能有局限性（文中没有，根据下文的连接词猜测）：We find layer-wise reconstruction acts like a regularizer which reduces the generalization error by matching each layer’s output distribution.

提出解决方案，需要一个intermediate granularity： In other words, both layer-wise and network-wise output reconstruction has their own drawbacks. And there should be a better bias-variance trade-off choice to conduct reconstruction at an intermediate granularity.

paragraph 2

先提出两种granularity的方案：The layer-wise optimization corresponds to layer-diagonal Hessian (Fig. 1b blue parts) and the network-wise optimization corresponds to full Hessian (Fig. 1b green parts).

根据前两者的方案，提出本文的：Similarly, we can define an intermediate block-diagonal Hessian

数学公式展示 block-diagonal Hessian

对公式进行解释：Such block-diagonal Hessian ignores the inter-block dependency and considers the intra-block dependency but it produces less generalization error.（忽略了块间依赖，考虑了块内依赖） Then we can block-by-block reconstruct the intermediate output.

paragraph 3

we define 2 extra kinds of intermediate reconstruction granularity

Layer-wise Reconstruction：Assume the Hessian matrix is layer-diagonal and optimize the layer output one-by-one. It does not consider cross-layer dependency and resemble existing methods
Bock-wise Reconstruction：A block is the core component in modern CNN, such as the Residual Bottleneck Block as shown in Fig. 1a. This method assumes the Hessian matrix is blockdiagonal and block-by-block perform reconstruction, which ignores inter-block dependencies.（该方法假设Hessian矩阵是块对角的，逐块进行重建，忽略了块间的依赖关系。）
Stage-wise Reconstruction：A stage is where the featuremaps will be downsampled and generate more channels, which is believed to produce higher-level features. Typical CNN in ImageNet dataset contains 4 or 5 different stages. This method simultaneously optimizes all layers within a stage and thus considers more dependencies than the block-wise method.（同时优化了一个阶段内的所有层，因此比块方法考虑了更多的依赖关系。）
Network-wise Reconstruction：Optimize the whole quantized network by reconstructing the output of the final layers. This method resembles distillation but does not result in good performances with few images because of high generalization error.（类似于蒸馏法，但由于泛化误差较大，在图像数量较少的情况下效果不佳。）

提出四种 reconstruction granularity后，测试出block-wise optimization效果最好。推测原因：We think this is because the main off-diagonal loss in the Hessian is concentrated in each block, as Fig. 1b orange part illustrated, while the inter-block loss is small and can be ignored in the optimization（由于Hessian中主要的非对角损失集中在每个块中，而块间损失较小，在优化时可以忽略）

提出有这种原因的依据：The shortcut connections, which is proposed in (He et al., 2016), may also increase the dependencies within a block.

提出两种granularity的缺点：Also, the stage-wise or net-wise optimization suffer from the bad generalization on the validation set and degenerate the final performances.

最后提出为什么选择 block-wise optimization，并没有给出重建粒度的最佳配置，只是指出两个优点：It is necessary to point out that our analysis does not give the optimal configuration of the reconstruction granularity. The choice of block-wise optimization comes from our experiments and we find this choice has two merits. (1) No hyper-parameters included and (2) applicable for all models and all tasks we tested.（）

3.3 APPROXIMATING PRE-ACTIVATION HESSIAN

paragraph 1

提出block-diagonal approximated Hessian matrix这一做法的优势： With block-diagonal approximated Hessian matrix, we can measure the cross-layer dependency inside each block （测量每个块内的跨层依赖关系）and transform any block’s second-order error to the output of this block（将任何块的二阶误差转换为此块的输出）

指出这个做法的需求 This objective requires the further computation of the knowledge in the rest of the network,

提出一种方法，One way is to follow Nagel et al. (2020) 但是这种方法有局限性： This method might be easy to implement but lose too much information.

paragraph 2

提出自己的方法：We use the diagonal Fisher Information Matrix (FIM) to replace the pre-activation Hessian

说明这种方法的合理性：The FIM is equal to the negative expected Hessian of the log-likelihood function, therefore, a simple corollary is that the Hessian of task loss will become FIM if the model distribution matches the true data distribution（先提出公式，后对公式进行理解和解释合理性）

提出方法的缺陷，但是也这已经是本文所能做的最好的：Although matching true data distribution seems impossible, this is the best we can do since the pretrained model is converged.

paragraph 3

根据The diagonal of the pre-activation FIM is equal to the squared gradients of each elements, which is successfully used in Adam这个成功的例子，提出修改后的optimization objective

与MSE方法进行对比：Compared with the MSE minimization, the above minimization incorporates the squared gradient information.（融入了平方梯度信息） If the output has higher absolute gradients, it will receive more attention when being reconstructed（如果输出具有较高的绝对梯度，则在重构时会受到更多的关注。）. A similar method for pruning the pre-activation has been proposed in Theis et al. (2018).

paragraph 4

BRECQ方法可与其他的优化方法进行兼容

在本文中融合了两种方法：Here we adopt adaptive rounding (Nagel et al., 2020) for weights and learned step size (Esser et al., 2020) for activation step size because we observe they generally perform better in PTQ,

最后强调方法的优势：We should emphasize that we only need a small subset (1024 in our experiments) of the whole training dataset to calibrate the quantized model（只需要整个训练数据集中的一小部分子集来校准量化后的模型。）. And we can obtain a quantized ResNet-18 within 20 minutes on a single GTX 1080TI GPU.

3.4 MIXED PRECISION

paragraph 1

To further push the limit of post-training quantization, we employ mixed precision techniques

然后对公式进行解释

paragraph 2

首先提出training loss在其他方法里的解决方法：Regarding the training loss L, we find that nearly all existing literature (Cai et al., 2020; Hubara et al., 2020; Dong et al., 2019) uses layer-wise measurement. They all assume the sensitivity within a layer is independent and can be summed together. Therefore, the mixed precision problem becomes an integer programming problem.（混合精度问题变成了整数规划问题）

接着提出我们对这种方法的不认可以及原因：However, we argue that the loss measurement should contain two parts: diagonal loss and off-diagonal loss, the first is the same with previous works and measure the sensitivity of each layer independently, while the off-diagonal loss is used to measure the cross-layer sensitivity.（对loss的计算包含两个部分）

但是直接计算all permutations 在理论上有些问题：Theoretically, we should examine all permutations, which results in 3n possibilities and prohibits the search algorithm（复杂度太高了）

所以提出了第一个尝试：Our first attempt is to reduce the off-diagonal loss into the blocklevel as we mentioned that the Hessian can be approximated to a block-diagonal matrix.（Hessian可以近似为块对角矩阵）

不过这种尝试还是有search space is large的问题：Granted, we still find the search space is large, for example, if a block has four layers, then we have to consider the 34 = 81 permutations for a single block.

因为Based on our preliminary experiments, we find that 4-bit and 8-bit quantization nearly do not drop the final accuracy.（4-bit and 8-bit quantization的量化基本不影响精度）。所以可以只考虑用2-bit：Hence we only take 2-bit permutations into consideration and drastically reduce the search space.

最后解决方法：We use genetic algorithm (Guo et al., 2020) to search the optimal bitwidth configuration with hardware performance threshold

4. EXPERIMENTS

实验过程：

In this section, we report experimental results for the ImageNet classification task and MS COCO object detection task.
The rest of this section will contain ablation study on reconstruction granularity, classification and detection results, mixed precision results and comparison with quantization-aware training.
In Appendix B, we conduct more experiments, including the impact of the first and the last layer, the impact of calibration dataset size and data source.

4.1 ABLATION STUDY

It can be seen from Table 1 that Block-wise optimization outperforms other methods

This result implies that the generalization error in net-wise and stage-wise optimization outweighs their off-diagonal loss.（GPT解释：这句话指出，在网络级和阶段级优化过程中，尽管存在非对角损失，但它们的一般化误差更大。这意味着尽管优化过程中考虑了不同部分之间的相互影响（非对角损失），但总体上，模型在未见过的数据上的表现并不如预期的好。换句话说，即使考虑了任务间的相互影响，模型在新数据上的泛化能力仍有待提高。）

In ResNet18, we find the difference is not significant, this can be potentially attributed to that ResNet-18 only has 19 layers in the body and the block size, as well as the stage size, is small, therefore leading to indistinct results.（ ResNet-18 在主体中只有 19 层，并且块大小和阶段大小都很小，因此导致结果模糊。）

4.2 IMAGENET

实验主要对象：

We conduct experiments on a variety of modern deep learning architectures, including ResNet (He et al., 2016) with normal convolution, MobileNetV2 (Sandler et al., 2018) with depthwise separable convolution and RegNet (Radosavovic et al., 2020) with group convolution.
Last but not least important, we also investigate the neural architecture searched (NAS) models, MNasNet (Tan et al., 2019).
We compare with strong baselines including Bias Correction, optimal MSE, AdaRound, AdaQuant, and Bit-split. Note that the first and the last layer are kept with 8-bit.

实验设置：

Accuracy comparison on weight-only quantized post-training models. Activations here are unquantized and kept full precision. We also conduct variance study for our experiments.
In Table 2, we only quantize weights into low-bit integers and keep activations full precision.
Note that the first and the last layer are kept with 8-bit.

实验结果：

While most of the existing methods have good performances in 4-bit quantization, they cannot successfully quantize the model into 2-bit.
Our method consistently achieves the lowest accuracy degradation for ResNets (within 5%) and other compact models.

实验设置：

Accuracy comparison on fully quantized post-training models. Activations here are quantized to 4-bit.

实验结果：

We find that 4-bit activation quantization can have a huge impact on RegNet and MobileNet
Nonetheless, our methods produce higher performance than other state-of-the-arts
To be noted, BRECQ is the first to promote the 2W4A accuracy of PTQ to a usable level while all other existing methods crashed

4.3 COMPARISON WITH QUANTIZATION-AWARE TRAINING

实验设置：

In this section, we compare our algorithm (post-training quantization) with some quantization-aware training methods, including PACT (Choi et al., 2018), DSQ (Gong et al., 2019), LSQ (Esser et al., 2020), and a mixed precision technique HAQ (Wang et al., 2019).

实验结果：

Table 4 shows that although BRECQ is a PTQ method with limited available data, it can achieve comparable accuracy results with existing quantization-aware training models.
In addition, our method can surpass them in 4bit MobileNetV2 while using less than one training GPU hours.
Our method also has comparable accuracy with HAQ, which is a training-based mixed precision method. Note that our GPU hours include 3 unified precision training (2-, 4-, 8-bit respectively) and further mixed-precision training only needs to check the lookup table. Instead, HAQ would end-to-end search for each hardware performance threshold from scratch.

4.4 MS COCO

实验设置：

To validate the effectiveness of BRECQ on other tasks, we conduct object detection on the twostage Faster-RCNN (Ren et al., 2015) and the one-stage RetinaNet (Lin et al., 2017). ResNet-18, 50 as well as MobileNetV2 are adopted as backbones for the detection model.

实验结果：

Results in Table 5 demonstrate our method nearly does not drop the performance in 4-bit weight quantization and 8bit activation.
In particular, BRECQ only decreases 0.21% mAP performance on 4-bit ResNet-18 backboned Faster RCNN.
On 4-bit ResNet-50 backboned RetinaNet, our method is outperforms the mixed precision based ZeroQ model by 3% mAP.
Even when the weight bit decreases to 2, the model still achieves near-to-original mAP.

4.5 MIXED PRECISION

实验设置：

we test (1) model-size guaranteed mixed precision and (2) FPGA latency guaranteed mixed precision to unleash the potential of mixed precision and further push the limit of PTQ.
We choose ResNet-18, MobileNetV2, and RegNetX-600MF to validate the efficacy of our algorithm.
Note that in this section, we keep activation in 8-bit because we only compare the discrepancy between the unified and mixed precision in weights.
We omit 3-bit weight quantization in unified precision because it is usually unfriendly to the hardware.

实验结果：

(1) mixed precision consistently outperforms unified precision, especially when using extremely low-bit, e.g., up to 10% accuracy increase with the same latency as the 2-bit model.
(2) mixed precision can produce many bit configurations that can adapt to plenty of hardware requirements while unified precision can only have 2 fixed models.

6. CONCLUSION

In this paper, we propose BRECQ, a post-training quantization framework by analyzing the secondorder error.

We show that the reconstruction of quantization at the block granularity arrives at a good balance of cross-layer dependency and first order approximation, especially in 2-bit weight quantization where no prior works succeed to quantize.
BRECQ is compatible with mixed precision and can reduce the search cost.
To our best knowledge, BRECQ reaches the highest performance in post-training quantization and is the first to be on a par with quantization-aware training using 4-bit.

其他

本文标签：笔记 limit post BRECQ PUSHING

版权声明：本文标题：“BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION”笔记内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dianzi/1727481933a1116922.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

“BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION”笔记

BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION

1. INTRODUCTION

2 PRELIMINARIES

3 EXPERIMENTS

3.1 CROSS-LAYER DEPENDENCY

3.2 BLOCK RECONSTRUCTION

3.3 APPROXIMATING PRE-ACTIVATION HESSIAN

3.4 MIXED PRECISION

4. EXPERIMENTS

4.1 ABLATION STUDY

4.2 IMAGENET

4.3 COMPARISON WITH QUANTIZATION-AWARE TRAINING

4.4 MS COCO

4.5 MIXED PRECISION

6. CONCLUSION

其他

更多相关文章

花了20天的时间给粉丝整理的一套“最全“的Java求职笔记(八)

HTML+CSS篇笔记

Vagrant入门笔记

有道云笔记常用快捷键

NAS黑群晖7.21入门笔记

笔记本电脑当服务器部署网站,笔记本当云服务器

X Chen笔记---Centos安装XWARE使用迅雷远程下载

记一次解决ANR问题的笔记

统信UOS系统开发笔记（八）：在统信UOS上编译搭建mqtt基础环境(版本使用QMQTT::Client)

Youtube ASX Portfolio的视频笔记 What is a Quant? - Financial Quantitative Analyst

[论文阅读笔记04]GFTE：Graph-based Financial Table Extraction

邦芯科技对于龙芯教师笔记标电脑的答题阐明-迎最故破体后果...

搜索引擎工作原理笔记

【笔记】Android常用应用市场发布整理

剑眉大侠的提权笔记

Cognitive Graph for Multi-Hop Reading Comprehension at Scale论文泛读笔记

“BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION”笔记

操作笔记 | win10 + WSL Ubuntu-20.04 + 软件安装，含apt换源操作和文件复制

整理一下自己对制作windows启动U盘和安装windows的笔记

python笔记-Selenium谷歌浏览器驱动下载

发表评论

推荐文章

【PC端聊天功能模板】vue-elementul简单实现电脑端客服聊天功能，pc端聊天系统静态页面布局，配套websocket方案和心跳重连机制【详细注释，拿来即用】

讯飞输入法使用评价

Win10安装Oracle19c，详细教程

不用反复制作启动盘的ventoy体验

解决谷歌浏览器特别卡问题

热门文章

电脑配置解惑

消金主流市场外的灰色地带：vivo应用商店聚集大量“伪现金贷平台”

Win10突然可以连接到wifi,qq可以上不能上网

从零开始：使用DirectX修复工具修复游戏崩溃

一键重启网卡BAT脚本工具

互联网早报：搜狗讯飞输入法重新上架 提供完全体验模式”或“基础打字模式”2种模式

win10家庭版安装Docker

怎样在win10系统中执行Linux命令

阿里云无影升级2.0 云电脑解决方案时代到来

10分钟！Mac配置Win主机上的共享打印机

最新文章

eap方法 华为手机怎么连wifi_如何手动连接802.1x EAP证书加密WIFI

python手机版破解wifi脚本,python手机版安装教程

Android 10.0 app获取当前已连接wifi列表ssid和密码功能实现

再一次获取你的WIFI密码（fluxion附视频）

分分钟搞定python破解无线wifi

使用aircrack和fluxion工具获取wifi密码的教程

oppo修改无线网服务器,简单小修改，你的OPPO手机wifi信号会马上提升

Android wifi列表扫描 密码连接 多个wifi切换登录 广播状态等都在这里

修改家中的WiFi密码

越狱iPhone手机使用openSSH wifi和usb连接mac电脑再免密码登录再用shell脚本执行教程

android 手机找回密码,如何使用android手机找回以前使用的无线密码

天翼网关如何开启虚拟服务器,天翼网关怎么设置wifi密码？天翼网关如何开启或关闭WIFI...

WiFi万能钥匙破解显密码版。

[MT8766][Android12] 修改WIFI热点默认名称、密码、IP地址以及默认开启热点

真正的手机破解wifi密码，aircrack-ng,reaver,仅限mx2（BCM4330芯片）

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

互联网早报：搜狗讯飞输入法重新上架提供完全体验模式”或“基础打字模式”2种模式

eap方法华为手机怎么连wifi_如何手动连接802.1x EAP证书加密WIFI

Android wifi列表扫描密码连接多个wifi切换登录广播状态等都在这里

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载