torque pbs排错"/>
torque pbs排错
1.1
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating jobtest/jobtest from mn01)
推测报错原因是,当并发提交数量较多时,torque默认会把一些作业以代理用户的方式,在其他提交节点或执行节点提交作业,在设置中允许代理用户和执行节点提交,以增强并发提交作业时的处理能力。
torque server - root:
-
# 设置提交节点的hosts
-
qmgr -c 'set server submit_hosts = mn01'
-
# 允许执行节点提交作业
-
qmgr -c 'set server allow_node_submit = True'
-
# 允许代理用户
-
qmgr -c 'set server allow_proxy_user = True'
1.2
qsub: submit error (Bad UID for job execution MSG=User lsh does not exist in server password file)
torque server - root:
-
qmgr -c 'set server allow_node_submit = True'
-
qmgr -c 'set server submit_hosts = mn01'
以及,在 server 节点建立与客户端相同名称的用户,并可以互相无密码登录
1.3
LOG_ERROR::Unable to get connection to socket (15096) in tcp_connect_sockaddr, Failed when trying to get privileged port - socket_get_tcp_priv() failed
-
systemctl stop pbs_mom pbs_server trqauthd pbs_sched
-
#systemctl status -l pbs_mom pbs_server trqauthd pbs_sched
-
systemctl restart pbs_mom && systemctl status pbs_mom
-
systemctl restart pbs_server && systemctl status pbs_server
-
systemctl restart trqauthd && systemctl status trqauthd
-
#systemctl restart pbs_sched && systemctl status pbs_sched
-
systemctl restart maui.d && systemctl status maui.d
-
systemctl status -l pbs_mom pbs_server trqauthd maui.d
tail -f /var/spool/torque/job_logs/20180511
1.4
-
5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_WARNING::Bad UID for job execution (15025) in log_commit_error, send_job commit failed, rc=15025 (Bad UID for job execution: start failed on unknown node)
-
5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 12.c01n01
-
5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 12.c01n01
-
5月 09 16:29:21 c01n01 PBS_Server[18995]: LOG_ERROR::Request invalid for state of job (15018) in 12.c01n01, obit received for job 12.c01n01 from host c01n05 with bad state (state: QUEUED)
-
5月 09 16:29:21 c01n01 pbs_server[18995]: Assertion failed, bad pointer in link: file "req_select.c", line 401
torque server - root:
tracejob <job.id>
-
05/09/2018 16:31:27 A queue=batch
-
05/09/2018 16:31:28.912 S unable to run job, MOM rejected/timeout
-
05/09/2018 16:31:28.913 S unable to run job, send to MOM '172.18.1.5' failed\
-
ssh root@172.18.1.5
-
groupadd -g 1004 jobtest
-
useradd -u 1004 -g 1004 jobtest
-
passwd jobtest
-
su - jobtest
-
ssh-******
-
[jobtest@c01n05 ~]$ ssh-copy-id mn01
1.5 server 找不到 mom 节点
-
May 09 13:58:46 server02.localdomain systemd[1]: Started TORQUE pbs_mom daemon.
-
May 09 13:58:46 server02.localdomain systemd[1]: Starting TORQUE pbs_mom daemon...
-
May 09 13:58:47 server02.localdomain pbs_mom[12621]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
-
May 09 13:58:47 server02.localdomain pbs_mom[12621]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
-
May 09 13:58:51 server02.localdomain pbs_mom[12621]: LOG_ERROR::send_update_to_a_server, Status update successfully sent after 1 MOM status update intervals
-
May 09 13:58:46 server02.localdomain systemd[1]: Started TORQUE pbs_server daemon.
-
May 09 13:58:46 server02.localdomain systemd[1]: Starting TORQUE pbs_server daemon...
-
May 09 13:58:51 server02.localdomain PBS_Server[12638]: LOG_ERROR::svr_is_request, bad attempt to connect from 192.168.0.82:519 (address not trusted - check entry in server_priv/nodes)
解决方法
vim /var/spool/torque/server_priv/nodes
1.6 环境冲突,设置环境变量 PATH
export PATH=/usr/local/sbin:/usr/local/bin/:$PATH
1.7
pbs_server[18995]: Assertion failed, bad pointer in link: file "req_select.c", line 401
1.8
PBS_Server[25497]: LOG_ERROR::Client connection not found. trqauthd unable to authorize user. Possible transient failure. Please try again (15135) in req_authenuser, trqauthd fail 49436
重启客户端的 trqauthd 服务,尝试重启网络(多网卡环境可能与网卡的优先级、启动顺序有关)
systemctl restart trqauthd && systemctl status trqauthd
1.9
-
6月 05 09:11:57 c01n01 PBS_Server[25414]: LOG_ERROR::Permission denied (13) in chk_file_sec, Security violation with "/var/spool/torque/spool/" - /var/spool/torque/spool/ cannot be accessed
-
6月 05 09:11:57 c01n01 PBS_Server[25414]: LOG_ERROR::PBS_Server, pbsd_init failed
.html
-
chmod -Rf 755 /var
-
chmod -Rf 777 /var/spool/torque/spool/
-
chmod +t /var/spool/torque/spool/
-
chmod -Rf 777 /var/spool/torque/undelivered/
-
chmod +t /var/spool/torque/undelivered/
1.10 PBS_MOM 报错
-
[root@c03n08 ~]# systemctl status -l pbs_mom trqauthd
-
● pbs_mom.service - TORQUE pbs_mom daemon
-
Loaded: loaded (/usr/lib/systemd/system/pbs_mom.service; enabled; vendor preset: disabled)
-
Active: failed (Result: core-dump) since Wed 2018-08-08 13:34:32 CST; 2s ago
-
Process: 19159 ExecStop=/bin/bash -c for i in {1..5}; do kill -0 $MAINPID &>/dev/null || exit 0; /usr
-
/local/sbin/momctl -s && exit; sleep 1; done (code=exited, status=0/SUCCESS)
-
Process: 19155 ExecStart=/usr/local/sbin/pbs_mom -F -d $PBS_HOME $PBS_ARGS (code=dumped, signal=SEGV)
-
Main PID: 19155 (code=dumped, signal=SEGV)
-
Aug 08 13:34:30 c03n08 systemd[1]: Started TORQUE pbs_mom daemon.
-
Aug 08 13:34:30 c03n08 systemd[1]: Starting TORQUE pbs_mom daemon...
-
Aug 08 13:34:31 c03n08 pbs_mom[19155]: LOG_ERROR::No such file or directory (2) in task_recov, open of task file
-
Aug 08 13:34:32 c03n08 pbs_mom[19155]: LOG_ERROR::init_abort_jobs, job 404.c01n01 no longer has valid password entry
-
- deleting
-
Aug 08 13:34:32 c03n08 systemd[1]: pbs_mom.service: main process exited, code=dumped, status=11/SEGV
-
Aug 08 13:34:32 c03n08 systemd[1]: Unit pbs_mom.service entered failed state.
-
Aug 08 13:34:32 c03n08 systemd[1]: pbs_mom.service failed.
参考:
.html
pbs - Torque : pbs_Server No such file or directory (2) in recov_attr, read2 - Stack Overflow
-
mkdir -p ~/backups/venv/var/spool/torque/mom_priv/jobs/
-
mv /var/spool/torque/mom_priv/jobs/404.c01n01.JB ~/backups/venv/var/spool/torque/mom_priv/jobs/
-
systemctl restart pbs_mom
-
systemctl status -l pbs_mom
1.11 提交作业时报错 ghost_queue
qsub: submit error (This queue had errors during its recovery. Please correct any settings that were lost on restart and then unset the ghost_queue setting via qmgr. Once this is unset, then the queue will be able to accept new jobs again.)
在 torque server 节点查看队列信息,发现 ghost_queue = True
, 这是因为之前队列节点发生一些问题(重启等)导致 torque 的保护机制生效,在队列内节点恢复后,把 ghost_queue
设置为 False 即可。参考:
Queue Attribute Reference
Automatic Queue and Job Recovery
-
qmgr -c 'list queue beijing'
-
qmgr -c 'set queue beijing ghost_queue = 0'
1.12
-
● pbs_server.service - TORQUE pbs_server daemon
-
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; enabled; vendor preset: disabled)
-
Active: failed (Result: signal) since Sat 2018-08-25 04:38:09 CST; 2 days ago
-
Process: 13194 ExecStart=/usr/local/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=killed, signal=ABRT)
-
Main PID: 13194 (code=killed, signal=ABRT)
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: 7f8dda17c000-7f8dda17d000 r--p 00021000 fd:00 33699 /usr/lib64/ld-2.17.so
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: 7f8dda17d000-7f8dda17e000 rw-p 00022000 fd:00 33699 /usr/lib64/ld-2.17.so
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: 7f8dda17e000-7f8dda17f000 rw-p 00000000 00:00 0
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: 7fff9d866000-7fff9d977000 rw-p 00000000 00:00 0 [stack]
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: 7fff9d9aa000-7fff9d9ac000 r-xp 00000000 00:00 0 [vdso]
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
-
Aug 25 04:38:08 c01n01 pbs_server[13194]: pbs_server is up (version - 6.1.2, port - 15001)
-
Aug 25 04:38:09 c01n01 systemd[1]: pbs_server.service: main process exited, code=killed, status=6/ABRT
-
Aug 25 04:38:09 c01n01 systemd[1]: Unit pbs_server.service entered failed state.
-
Aug 25 04:38:09 c01n01 systemd[1]: pbs_server.service failed.
1.13 Common Reasons Why Jobs Won't Start
Common Reasons Why Jobs Won't Start
1.13.1 could not locate requested resources xxx (node_spec failed) job allocation request exceeds currently available cluster nodes
[torquedev] strange behavior in TORQUE 5.1.1
[torqueusers] pbsnodes still show node state=free with all np assigned
更多推荐
torque pbs排错
发布评论