admin管理员组

文章数量:1608633

近日在对一个包含InplaceABN模块的网络进行魔改的时候,遇到了如下报错:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 256, 7, 7]], which is output 0 of InPlaceABNBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

之前应用InplaceABN的时候,并没有研读过paper和代码,所以在解决这个问题的时候,花费了数小时,像无头苍蝇一样试错,虽然知道是连续的inplace操作引发的问题,但是没有定位到具体引发问题是在哪个block的哪块代码,居然一直在错误地方尝试clone()来解决。次日常看github的issue,才将问题原因真正搞清楚。

1. InplaceABN提供的block

ABN is standard BN + activation (no memory savings).
InPlaceABN is BN+activation done inplace (with memory savings).
InPlaceABNSyncis BN+activation done inplace (with memory savings) + computation of BN (fwd+bwd) with data from all the gpus.

2. Inplace shortcut

out += residual to out = out + residual

+=和add_()是Inplace操作

我遇到的问题其实是,在ResidualBlock中,有InplaceABN和add_两个连续的inplce操作。

3. 解决方案

 

reference:

https://github/mapillary/inplace_abn/issues/6

inplace_abn/resnet.py at main · mapillary/inplace_abn · GitHub

inplace_abn/residual.py at main · mapillary/inplace_abn · GitHub

本文标签: InplaceABNError