admin管理员组文章数量:1577816
环境交代:
系统:ubuntu server 20.04
集群总共机器8台,目前管理节点,3-8节点均正常,节点2反复尝试无数次,输入sinfo提示slurm_load_partitions: Unable to contact slurm controller (connect failure)
slurmd服务状态:
root@node02:/etc# systemctl status slurmd
● slurmd.service
Loaded: loaded (/etc/init.d/slurmd; generated)
Active: active (running) since Thu 2022-07-07 09:36:21 UTC; 21min ago
Docs: man:systemd-sysv-generator(8)
Process: 2460 ExecStart=/etc/init.d/slurmd start (code=exited, status=0/SUCCESS)
Tasks: 2 (limit: 629145)
Memory: 10.9M
CGroup: /system.slice/slurmd.service
└─1929 /etc/init.d/slurmd start
Jul 07 09:36:21 node02 systemd[1]: Starting slurmd.service...
Jul 07 09:36:21 node02 systemd[1]: Started slurmd.service.
root@node02:/etc#
slurm.conf配置文件:
NodeName=node01 NodeAddr=10.0.0.1 CPUs=6 State=UNKNOWN
NodeName=node02 NodeAddr=10.0.0.2 CPUs=96 State=UNKNOWN
NodeName=node03 NodeAddr=10.0.0.3 CPUs=96 State=UNKNOWN
NodeName=node04 NodeAddr=10.0.0.4 CPUs=96 State=UNKNOWN
NodeName=node05 NodeAddr=10.0.0.5 CPUs=96 State=UNKNOWN
NodeName=node06 NodeAddr=10.0.0.6 CPUs=96 State=UNKNOWN
NodeName=node07 NodeAddr=10.0.0.7 CPUs=96 State=UNKNOWN
NodeName=node08 NodeAddr=10.0.0.8 CPUs=96 State=UNKNOWN
PartitionName=control Nodes=node01 Default=NO MaxTime=INFINITE State=UP
PartitionName=compute Nodes=node[02-08] Default=YES MaxTime=INFINITE State=UP
#PartitionName=debug Nodes=master,io01,node02 Default=YES MaxTime=INFINITE State=UP
其它节点均正常,不带02节点可以计算,运行任务脚本正常,只要使用节点02,任务无输出,且强制scancel后节点状态一直comp,任务状态一直CG,请求大牛帮助解答一下
补充:问题在第二天排查中找到,因为之前02节点,一直存在重启后munge配置文件自动删除问题,于是选择重装系统,重装系统后/etc/hosts里面127.0.1.1 node01一直忘了更改,将其更改为127.0.1.1 node02问题解决
本文标签: Contactunableslurmloadpartitionsslurmfailure
版权声明:本文标题:求助贴:slurm_load_partitions: Unable to contact slurm controller (connect failure) 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dianzi/1726640386a1079580.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论