问题描述
我从某处读到,如果您选择的批量大小为 2 的幂,则训练速度会更快.这是什么规则?这适用于其他应用程序吗?你能提供一份参考论文吗?
解决方案从算法上讲,使用更大的 mini-batch 可以减少随机梯度更新的方差(通过取 mini-batch 中梯度的平均值),这反过来又允许您采用更大的步长,这意味着优化算法将取得更快的进展.
然而,在目标中达到一定精度所完成的工作量(就梯度计算的数量而言)将是相同的:当 mini-batch 大小为 n 时,更新方向的方差将减少乘以因子 n,因此该理论允许您采用大 n 倍的步长,因此单步将带您大致达到与小批量大小为 1 的 SGD 的 n 步相同的准确度.>
关于tensorFlow,我没有找到你肯定的证据,而且是github上已经关闭的问题:https://github/tensorflow/tensorflow/issues/4132
请注意,将图像大小调整为 2 的幂是有意义的(因为池化通常在 2X2 窗口中完成),但这完全不同.
I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?
解决方案Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.
As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github/tensorflow/tensorflow/issues/4132
Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.
这篇关于在 tensorflow 上使用批量大小作为“2 的幂"是否更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论