admin管理员组

文章数量:1564187

BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION

1. INTRODUCTION

  1. Based on the second-order analysis, we define a set of reconstruction units and show that block reconstruction is the best choice with the support from theoretical and empirical evidence. We also use Fisher Information Matrix to assign each pre-activation with an importance measure during reconstruction.
  2. We incorporate genetic algorithm and the well-defined intra-block sensitivity measure to generate latency and size guaranteed mixed precision quantized neural networks, which fulfills a general improvement on both specialized hardware (FPGA) and general hardware (ARM CPU).
  3. We conduct extensive experiments to verify our proposed methods. We find that our method is applicable to a large variety of tasks and models. Moreover, we show that post-training quantization can quantize weights to INT2 without significant accuracy loss for the first time.

2 PRELIMINARIES

3 EXPERIMENTS

3.1 CROSS-LAYER DEPENDENCY

3.2 BLOCK RECONSTRUCTION

paragraph 1

提出问题:Although the network output reconstruction has an accurate estimation of the second-order error, we find in practice it is worse than the layer-by-layer reconstruction in PTQ.(对network output进行重构虽然对二阶误差有准确的估计但实际效果不如逐层)

分析原因:The primary reason for this is optimizing the whole networks over 1024 calibration data samples leads to over-fitting easily.

引用相关论文继续解释:As Jakubovitz et al. (2019) explained, the networks can have perfect expressivity when the number of parameters exceeds the number of data samples during training, but lower training error does not ensure lower test error.

对 layer-wise reconstruction做出类比,因此可能有局限性(文中没有,根据下文的连接词猜测):We find layer-wise reconstruction acts like a regularizer which reduces the generalization error by matching each layer’s output distribution.

提出解决方案,需要一个intermediate granularity: In other words, both layer-wise and network-wise output reconstruction has their own drawbacks. And there should be a better bias-variance trade-off choice to conduct reconstruction at an intermediate granularity.

paragraph 2

先提出两种granularity的方案:The layer-wise optimization corresponds to layer-diagonal Hessian (Fig. 1b blue parts) and the network-wise optimization corresponds to full Hessian (Fig. 1b green parts).

根据前两者的方案,提出本文的:Similarly, we can define an intermediate block-diagonal Hessian

数学公式展示 block-diagonal Hessian

对公式进行解释:Such block-diagonal Hessian ignores the inter-block dependency and considers the intra-block dependency but it produces less generalization error.(忽略了块间依赖,考虑了块内依赖) Then we can block-by-block reconstruct the intermediate output.

paragraph 3

we define 2 extra kinds of intermediate reconstruction granularity

  • Layer-wise Reconstruction:Assume the Hessian matrix is layer-diagonal and optimize the layer output one-by-one. It does not consider cross-layer dependency and resemble existing methods
  • Bock-wise Reconstruction:A block is the core component in modern CNN, such as the Residual Bottleneck Block as shown in Fig. 1a. This method assumes the Hessian matrix is blockdiagonal and block-by-block perform reconstruction, which ignores inter-block dependencies.(该方法假设Hessian矩阵是块对角的,逐块进行重建,忽略了块间的依赖关系。)
  • Stage-wise Reconstruction:A stage is where the featuremaps will be downsampled and generate more channels, which is believed to produce higher-level features. Typical CNN in ImageNet dataset contains 4 or 5 different stages. This method simultaneously optimizes all layers within a stage and thus considers more dependencies than the block-wise method.(同时优化了一个阶段内的所有层,因此比块方法考虑了更多的依赖关系。)
  • Network-wise Reconstruction:Optimize the whole quantized network by reconstructing the output of the final layers. This method resembles distillation but does not result in good performances with few images because of high generalization error.(类似于蒸馏法,但由于泛化误差较大,在图像数量较少的情况下效果不佳。)

提出四种 reconstruction granularity后,测试出block-wise optimization效果最好。 推测原因:We think this is because the main off-diagonal loss in the Hessian is concentrated in each block, as Fig. 1b orange part illustrated, while the inter-block loss is small and can be ignored in the optimization(由于Hessian中主要的非对角损失集中在每个块中,而块间损失较小,在优化时可以忽略)

提出有这种原因的依据:The shortcut connections, which is proposed in (He et al., 2016), may also increase the dependencies within a block.

提出两种granularity的缺点:Also, the stage-wise or net-wise optimization suffer from the bad generalization on the validation set and degenerate the final performances.

最后提出为什么选择 block-wise optimization,并没有给出重建粒度的最佳配置,只是指出两个优点 :It is necessary to point out that our analysis does not give the optimal configuration of the reconstruction granularity. The choice of block-wise optimization comes from our experiments and we find this choice has two merits. (1) No hyper-parameters included and (2) applicable for all models and all tasks we tested.()

3.3 APPROXIMATING PRE-ACTIVATION HESSIAN

paragraph 1

提出block-diagonal approximated Hessian matrix这一做法的优势: With block-diagonal approximated Hessian matrix, we can measure the cross-layer dependency inside each block (测量每个块内的跨层依赖关系)and transform any block’s second-order error to the output of this block(将任何块的二阶误差转换为此块的输出)

指出这个做法的需求 This objective requires the further computation of the knowledge in the rest of the network,

提出一种方法,One way is to follow Nagel et al. (2020) 但是这种方法有局限性: This method might be easy to implement but lose too much information.

paragraph 2

提出自己的方法:We use the diagonal Fisher Information Matrix (FIM) to replace the pre-activation Hessian

说明这种方法的合理性:The FIM is equal to the negative expected Hessian of the log-likelihood function, therefore, a simple corollary is that the Hessian of task loss will become FIM if the model distribution matches the true data distribution(先提出公式,后对公式进行理解和解释合理性)

提出方法的缺陷,但是也这已经是本文所能做的最好的:Although matching true data distribution seems impossible, this is the best we can do since the pretrained model is converged.

paragraph 3

根据The diagonal of the pre-activation FIM is equal to the squared gradients of each elements, which is successfully used in Adam这个成功的例子,提出修改后的optimization objective

与MSE方法进行对比:Compared with the MSE minimization, the above minimization incorporates the squared gradient information.(融入了平方梯度信息) If the output has higher absolute gradients, it will receive more attention when being reconstructed(如果输出具有较高的绝对梯度,则在重构时会受到更多的关注。). A similar method for pruning the pre-activation has been proposed in Theis et al. (2018).

paragraph 4

BRECQ方法可与其他的优化方法进行兼容

在本文中融合了两种方法:Here we adopt adaptive rounding (Nagel et al., 2020) for weights and learned step size (Esser et al., 2020) for activation step size because we observe they generally perform better in PTQ,

最后强调方法的优势:We should emphasize that we only need a small subset (1024 in our experiments) of the whole training dataset to calibrate the quantized model(只需要整个训练数据集中的一小部分子集来校准量化后的模型。). And we can obtain a quantized ResNet-18 within 20 minutes on a single GTX 1080TI GPU.

3.4 MIXED PRECISION

paragraph 1

To further push the limit of post-training quantization, we employ mixed precision techniques

然后对公式进行解释

paragraph 2

首先提出training loss在其他方法里的解决方法:Regarding the training loss L, we find that nearly all existing literature (Cai et al., 2020; Hubara et al., 2020; Dong et al., 2019) uses layer-wise measurement. They all assume the sensitivity within a layer is independent and can be summed together. Therefore, the mixed precision problem becomes an integer programming problem.(混合精度问题变成了整数规划问题)

接着提出我们对这种方法的不认可以及原因:However, we argue that the loss measurement should contain two parts: diagonal loss and off-diagonal loss, the first is the same with previous works and measure the sensitivity of each layer independently, while the off-diagonal loss is used to measure the cross-layer sensitivity.(对loss的计算包含两个部分)

但是直接计算all permutations 在理论上有些问题:Theoretically, we should examine all permutations, which results in 3n possibilities and prohibits the search algorithm(复杂度太高了)

所以提出了第一个尝试:Our first attempt is to reduce the off-diagonal loss into the blocklevel as we mentioned that the Hessian can be approximated to a block-diagonal matrix.(Hessian可以近似为块对角矩阵)

不过这种尝试还是有search space is large的问题:Granted, we still find the search space is large, for example, if a block has four layers, then we have to consider the 34 = 81 permutations for a single block.

因为Based on our preliminary experiments, we find that 4-bit and 8-bit quantization nearly do not drop the final accuracy.(4-bit and 8-bit quantization的量化基本不影响精度)。所以可以只考虑用2-bit:Hence we only take 2-bit permutations into consideration and drastically reduce the search space.

最后解决方法:We use genetic algorithm (Guo et al., 2020) to search the optimal bitwidth configuration with hardware performance threshold

4. EXPERIMENTS

实验过程:

  • In this section, we report experimental results for the ImageNet classification task and MS COCO object detection task.
  • The rest of this section will contain ablation study on reconstruction granularity, classification and detection results, mixed precision results and comparison with quantization-aware training.
  • In Appendix B, we conduct more experiments, including the impact of the first and the last layer, the impact of calibration dataset size and data source.

4.1 ABLATION STUDY

It can be seen from Table 1 that Block-wise optimization outperforms other methods

This result implies that the generalization error in net-wise and stage-wise optimization outweighs their off-diagonal loss.(GPT解释:这句话指出,在网络级和阶段级优化过程中,尽管存在非对角损失,但它们的一般化误差更大。这意味着尽管优化过程中考虑了不同部分之间的相互影响(非对角损失),但总体上,模型在未见过的数据上的表现并不如预期的好。换句话说,即使考虑了任务间的相互影响,模型在新数据上的泛化能力仍有待提高。)

In ResNet18, we find the difference is not significant, this can be potentially attributed to that ResNet-18 only has 19 layers in the body and the block size, as well as the stage size, is small, therefore leading to indistinct results.( ResNet-18 在主体中只有 19 层,并且块大小和阶段大小都很小,因此导致结果模糊。)

4.2 IMAGENET

实验主要对象:

  • We conduct experiments on a variety of modern deep learning architectures, including ResNet (He et al., 2016) with normal convolution, MobileNetV2 (Sandler et al., 2018) with depthwise separable convolution and RegNet (Radosavovic et al., 2020) with group convolution.
  • Last but not least important, we also investigate the neural architecture searched (NAS) models, MNasNet (Tan et al., 2019).
  • We compare with strong baselines including Bias Correction, optimal MSE, AdaRound, AdaQuant, and Bit-split. Note that the first and the last layer are kept with 8-bit.

实验设置:

  • Accuracy comparison on weight-only quantized post-training models. Activations here are unquantized and kept full precision. We also conduct variance study for our experiments.
  • In Table 2, we only quantize weights into low-bit integers and keep activations full precision.
  • Note that the first and the last layer are kept with 8-bit.

实验结果:

  • While most of the existing methods have good performances in 4-bit quantization, they cannot successfully quantize the model into 2-bit.
  • Our method consistently achieves the lowest accuracy degradation for ResNets (within 5%) and other compact models.

实验设置:

  • Accuracy comparison on fully quantized post-training models. Activations here are quantized to 4-bit.

实验结果:

  • We find that 4-bit activation quantization can have a huge impact on RegNet and MobileNet
  • Nonetheless, our methods produce higher performance than other state-of-the-arts
  • To be noted, BRECQ is the first to promote the 2W4A accuracy of PTQ to a usable level while all other existing methods crashed

4.3 COMPARISON WITH QUANTIZATION-AWARE TRAINING

实验设置:

In this section, we compare our algorithm (post-training quantization) with some quantization-aware training methods, including PACT (Choi et al., 2018), DSQ (Gong et al., 2019), LSQ (Esser et al., 2020), and a mixed precision technique HAQ (Wang et al., 2019).

实验结果:

  • Table 4 shows that although BRECQ is a PTQ method with limited available data, it can achieve comparable accuracy results with existing quantization-aware training models.
  • In addition, our method can surpass them in 4bit MobileNetV2 while using less than one training GPU hours.
  • Our method also has comparable accuracy with HAQ, which is a training-based mixed precision method. Note that our GPU hours include 3 unified precision training (2-, 4-, 8-bit respectively) and further mixed-precision training only needs to check the lookup table. Instead, HAQ would end-to-end search for each hardware performance threshold from scratch.

4.4 MS COCO

实验设置:

To validate the effectiveness of BRECQ on other tasks, we conduct object detection on the twostage Faster-RCNN (Ren et al., 2015) and the one-stage RetinaNet (Lin et al., 2017). ResNet-18, 50 as well as MobileNetV2 are adopted as backbones for the detection model.

实验结果:

  • Results in Table 5 demonstrate our method nearly does not drop the performance in 4-bit weight quantization and 8bit activation.
  • In particular, BRECQ only decreases 0.21% mAP performance on 4-bit ResNet-18 backboned Faster RCNN.
  • On 4-bit ResNet-50 backboned RetinaNet, our method is outperforms the mixed precision based ZeroQ model by 3% mAP.
  • Even when the weight bit decreases to 2, the model still achieves near-to-original mAP.

4.5 MIXED PRECISION

实验设置:

  • we test (1) model-size guaranteed mixed precision and (2) FPGA latency guaranteed mixed precision to unleash the potential of mixed precision and further push the limit of PTQ.
  • We choose ResNet-18, MobileNetV2, and RegNetX-600MF to validate the efficacy of our algorithm.
  • Note that in this section, we keep activation in 8-bit because we only compare the discrepancy between the unified and mixed precision in weights.
  • We omit 3-bit weight quantization in unified precision because it is usually unfriendly to the hardware.

实验结果:

  • (1) mixed precision consistently outperforms unified precision, especially when using extremely low-bit, e.g., up to 10% accuracy increase with the same latency as the 2-bit model.
  • (2) mixed precision can produce many bit configurations that can adapt to plenty of hardware requirements while unified precision can only have 2 fixed models.

6. CONCLUSION

In this paper, we propose BRECQ, a post-training quantization framework by analyzing the secondorder error.

  • We show that the reconstruction of quantization at the block granularity arrives at a good balance of cross-layer dependency and first order approximation, especially in 2-bit weight quantization where no prior works succeed to quantize.
  • BRECQ is compatible with mixed precision and can reduce the search cost.
  • To our best knowledge, BRECQ reaches the highest performance in post-training quantization and is the first to be on a par with quantization-aware training using 4-bit.

其他

本文标签: 笔记limitpostBRECQPUSHING