ubuntu SLURM搭建"/>
ubuntu SLURM搭建
安装munge(每一个节点都需要安装)
这里有两台机器在一个局域网中,在/etc/hosts上分别配置好IP 和 主机名
IP Master
IP Slave1
配置好后看看能不能互相ping通
ssh是否能连上,如果没有安装ssh运行
sudo apt install ssh
在/etc/ssh/sshd_config里的permitrootlogin设置为yes
sudo su
apt upgrade
apt install munge
通过apt安装会找到stable版本,相比于手动安装会更简单,并且会出错。
在Master上创建密钥
create-munge-key
scp /etc/munge/munge.key root@Slave1:/etc/munge/
#/etc/host修改对应的ip
分别修改权限每一个节点都需要)
#修改目录属主
chown -R munge.munge /var/{lib,log,run}/munge
chown -R munge.munge /etc/munge
#修改目录模式
chmod 711 /var/lib/munge
chmod 700 /var/log/munge
chmod 755 /var/run/munge
chmod 700 /etc/munge
chmod 400 /etc/munge/munge.key
这里需要通过id 查看munge的uid 和gid如果id不相同需要修改二者的ID(通过usermod 和groupmod)
重新启动muge
systemctl restart munge
systemctl status munge
# 查看是否启动成功
- 注意一定要修改权限,否则启动不了,并且最好先启动管理节点在启动其他节点
最后需要查看munge是否联通
munge -n | unmunge
munge -n | ssh Slave1 unmunge
安装slurm
在两个节点安装slurm
apt install slurm-wlm slurm-wlm-doc -y
slurmd -C
slurmc -V
# 查看是否启动成功
配置conf文件
vim /etc/slurm-llnl/slurm.conf# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=Master
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
SlurmctldTimeout=3600
SlurmdTimeout=300
BatchStartTimeout=3600
PropagateResourceLimits=NONE
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
# Acct
AccountingStorageEnforce=1
AccountingStorageLoc=/opt/slurm/acct
AccountingStorageType=accounting_storage/filetxtJobCompLoc=/opt/slurm/jobcomp
JobCompType=jobcomp/filetxtJobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#
# COMPUTE NODES
# 所有的节点都要写上
NodeName=Master CPUs=4 State=UNKNOWN
NodeName=Slave1 CPUs=4 State=UNKNOWN
PartitionName=debug Nodes=Master,Slave1 Default=YES MaxTime=INFINITE State=UP
所有的节点文件都是一样的,修改节点需要分别重启服务
修改文件权限(每个文件都需要修改)
rm -rf /var/spool/slurm-llnl
mkdir /var/spool/slurm-llnl
chown -R slurm.slurm /var/spool/slurm-llnl
rm -rf /var/run/slurm-llnl/
mkdir /var/run/slurm-llnl/
chown -R slurm.slurm /var/run/slurm-llnl/
sudo mkdir -p /opt/slurm
sudo chmod -Rf 777 /opt/slurm
cd /opt/slurm
touch acct
touch jobcomp
启动服务(所有节点)
systemctl start slurmd
systemctl enable slurmd
systemctl start slurmctld
systemctl enable slurmctld
使用sinfo查看节点信息
更多推荐
ubuntu SLURM搭建
发布评论