Torch.distributed.elastic 关于 pytorch 不稳定

编程入门 行业动态 更新时间:2024-10-15 08:21:52

Torch.distributed.elastic 关于 pytorch <a href=https://www.elefans.com/category/jswz/34/1744044.html style=不稳定"/>

Torch.distributed.elastic 关于 pytorch 不稳定

错误日志:

Epoch: [229] Total time: 0:17:21
Test:   [ 0/49]  eta: 0:05:00  loss: 1.7994 (1.7994)  acc1: 78.0822 (78.0822)  acc5: 95.2055 (95.2055)  time: 6.1368  data: 5.9411  max mem: 10624
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44348 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44349 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44350 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44351 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44352 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44353 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44354 closing signal SIGHUP
Traceback (most recent call last):File "/home/biometrics/miniconda3/envs/torch/bin/torchrun", line 33, in <module>sys.exit(load_entry_point('torch==1.12.0.dev20220502', 'console_scripts', 'torchrun')())File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapperreturn f(*args, **kwargs)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in mainrun(args)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run)(*cmd_args)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agentresult = agent.run()File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapperresult = f(*args, **kwargs)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in runresult = self._invoke_run(role)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_runtime.sleep(monitor_interval)File "/home/biometrics/miniconda3/envs/torch/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handlerraise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 44343 got signal: 1

网上的解决办法是:

更多推荐

Torch.distributed.elastic 关于 pytorch 不稳定

本文发布于:2024-02-06 11:20:33,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1748370.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:不稳定   distributed   Torch   pytorch   elastic

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!