admin管理员组文章数量:1618713
错误信息可能是: unhandled cuda error, NCCL version 2.4.8
设置以下环境变量,查看nccl 错误日志:
export NCCL_SOCKET_IFNAME=enp6s0
export NCCL_IB_DISABLE=1
export NCLL_DEBUG=info
注意,以上export NCCL_SOCKET_IFNAME=enp6s0 中的enp6s0 为你本地的网卡名称,用ifconfig获取。
cuda版本不匹配 会有以下信息:
znsoft-virtual-machine:102553:102553 [0] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
NCCL version 2.4.8+cuda10.2
znsoft-virtual-machine:102620:102620 [1] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Channel 00 : 0 1
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
znsoft-virtual-machine:102620:102695 [1] NCCL INFO comm 0x7f0438002580 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
znsoft-virtual-machine:102553:102694 [0] NCCL INFO comm 0x7fbb600025a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
znsoft-virtual-machine:102620:102620 [1] enqueue:197 NCCL WARN Cuda failure 'invalid device function'
znsoft-virtual-machine:102620:102620 [1] NCCL INFO misc/group:148 -> 1
znsoft-virtual-machine:102553:102553 [0] NCCL INFO Launch mode Parallel
znsoft-virtual-machine:102553:102553 [0] enqueue:197 NCCL WARN Cuda failure 'invalid device function'
注意最后一行: enqueue:197 NCCL WARN Cuda failure 'invalid device function'
这是pytorch编译时的cuda和本机安装的cuda不一致导致。
注意要安装nccl 包,我是用以下命令编译的:
git clone https://github/NVIDIA/nccl.git
cd nccl
export NVCC_GENCODE=-gencode=arch=compute_80,code=compute_80
make CUDA_HOME=/usr/local/cuda
make install
解决办法:
安装pytorch时,用的cuda和本机安装的一致:
运行nvidia-smi 后得到的版本要和pytorch安装 时的版本一样,我的是: CUDA Version: 11.7
安装pytorch要使用 cuda 11.6/7之类接近的版本:
本文标签: failureInvaliddeviceNCCLWARN
版权声明:本文标题:解决 NCCL WARN Cuda failure ‘invalid device function‘ , unhandled cuda error, NCCL version 2.4.8 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dongtai/1728784567a1173161.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论