BUG记录Exception: Current loss scale already at minimum

编程入门行业动态更新时间:2024-10-22 05:03:27

BUG记录Exception: <a href=https://www.elefans.com/category/jswz/34/1733162.html style= Current loss scale already at minimum"/>

BUG记录Exception: Current loss scale already at minimum

我的实验时基于huggingface/diffusers开发的，使用了pytorch_lightning，使用了AMP以及deepspeed zero stage2，初次实验未使用scale learning rate策略，正常运行，没有报错，但是使用了scale_lr之后，报错

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

在网上搜了很多，有用的信息有：

主要思路是skip这个exception，可以追溯的codes:
对应的修改文件
deepspeed.runtime.fp16.loss_scaler
这个错误可能是某些bad batch导致的，有人提供了这样的思路，很遗憾不支持FP16 ：
还有人提出将fp16改成bf16（没测试）
观察梯度变化

from lightning.pytorch.utilities import grad_normdef on_before_optimizer_step(self, optimizer):# Compute the 2-norm for each layer# If using mixed precision, the gradients are already unscaled herenorms = grad_norm(self.layer, norm_type=2)self.log_dict(norms)

看是否出现梯度爆炸的情况，出现的话使用梯度clip的形式处理

更多推荐

BUG记录Exception: Current loss scale already at minimum

本文发布于:2024-03-23 14:42:50，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1739388.html