tianshou + OpenAI GYM 强化学习模型 雅达利游戏环境 (附完整代码)

编程入门 行业动态 更新时间:2024-10-22 17:21:39

tianshou + OpenAI GYM 强化学习模型 雅<a href=https://www.elefans.com/category/jswz/34/1698481.html style=达利游戏环境 (附完整代码)"/>

tianshou + OpenAI GYM 强化学习模型 雅达利游戏环境 (附完整代码)

目录啊

  • 环境安装
    • tianshou + pytorch 安装
    • gym + atari环境安装
    • 其他:
      • NOTE1 env.render() 执行出错
      • NOTE2 windows 用户安装问题 module could not be found' when running:
    • Reference:
  • 輸入為 ARM 類型的雅達利遊戲強化學習代码实现
    • 官网 Deep Q learning 样例学习
    • 修改 Deep Q learning 的样例
    • 測試訓練結果

环境安装

tianshou + pytorch 安装

1、首先安装tianshou库

pip install tianshou

2、由于天授是基于pytorch开发的 所以还需要安装和自己电脑匹配的pytorch

pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio===0.9.0 -f .html

gym + atari环境安装

1、首先安装环境

pip install gym

2、导入atari的 ROMs 包
下载 url:.html
先下载到download目录

在下载目录下解压缩

然后执行

python3 -m atari_py.import_roms ./

使用代码验证一下:

import gym
import time
env=gym.make('MsPacman-ram-v0') #创建对应的游戏环境
env.seed(1) #可选,设置随机数,以便让过程重现s=env.reset() #重新设置环境,并得到初始状态
while True: #每个步骤time.sleep(0.05)env.render() #展示环境a=env.action_space.sample() # 智能体随机选择一个动作s_,r,done,info=env.step(a) #环境返回执行动作a后的下一个状态、奖励值、是否终止以及其他信息# print("s_",s_)print("r",r)# print("done",done)# print("info",info)if done:break

其他:

NOTE1 env.render() 执行出错

 如果你是采用ssh或者其他命令行远程控制的方式 注意要注释掉
    env.render() #展示环境

NOTE2 windows 用户安装问题 module could not be found’ when running:

  windows 用户安装步骤类似 可能会出现c++库缺失的问题 可以参考以下步骤解决

1、卸载gym和atari-py(如果已经安装):

pip uninstall atari-py
pip uninstall gym[atari]

2、下载VS构建工具 : /?sku=BuildTools&rel=16

3、运行VS build setup并选择"C++ build tools"并安装它。

4、重启电脑

5、安装 cmake, atari-py and gym

pip install cmake
pip install atari-py
pip install gym[atari]

6、测试代码

import atari_py
print(atari_py.list_games())

Reference:

解决方法 windows 环境报错: ‘module could not be found’ when running gym.make for atari environment.
openai/atari-py安装参考链接
openai/atari environment
gym展示代码
pytorch官网 可以找到合适你电脑的pytorch

輸入為 ARM 類型的雅達利遊戲強化學習代码实现

开头推一波 tianshou这个强化学习库

tianshou github website
tianshou Totorial

官网 Deep Q learning 样例学习

我们可以从tianshou Totorial先看一下Deep Q learning 的实现过程:

  1. 营造环境
import gym
import tianshou as ts
env = gym.make('CartPole-v0')
  1. 设置多环境包装器
    tianshou支持所有算法并行采样。它提供四种类型的矢量环境包装的:DummyVectorEnv,SubprocVectorEnv,ShmemVectorEnv,和RayVectorEnv。它可以按如下方式使用:(更多解释可以在Parallel Sampling找到)
train_envs = gym.make('CartPole-v0')
test_envs = gym.make('CartPole-v0')train_envs = ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(10)])
test_envs = ts.env.DummyVectorEnv([lambda: gym.make('CartPole-v0') for _ in range(100)])

NOTE :
如果使用自己的環境需要确保seed方法设置正确,例如,
否则,这些环境的输出可能彼此相同。

def seed(self, seed):np.random.seed(seed)
  1. 建立网络
import torch, numpy as np
from torch import nnclass Net(nn.Module):def __init__(self, state_shape, action_shape):super().__init__()self.model = nn.Sequential(nn.Linear(np.prod(state_shape), 128), nn.ReLU(inplace=True),nn.Linear(128, 128), nn.ReLU(inplace=True),nn.Linear(128, 128), nn.ReLU(inplace=True),nn.Linear(128, np.prod(action_shape)),)def forward(self, obs, state=None, info={}):if not isinstance(obs, torch.Tensor):obs = torch.tensor(obs, dtype=torch.float)batch = obs.shape[0]logits = self.model(obs.view(batch, -1))return logits, statestate_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.shape or env.action_space.n
net = Net(state_shape, action_shape)
optim = torch.optim.Adam(net.parameters(), lr=1e-3)
  1. 實例化 Policy
policy = ts.policy.DQNPolicy(net, optim, discount_factor=0.9, estimation_step=3, target_update_freq=320)
  1. 設置收集器
    Collector是tianshou的一个关键概念。它允许Policy方便地与不同类型的环境交互。在每个步骤中,收集器将让Policy执行(至少)指定数量的步骤或情节,并将数据存储在replay buffer中。
train_collector = ts.data.Collector(policy, train_envs, ts.data.VectorReplayBuffer(20000, 10), exploration_noise=True)
test_collector = ts.data.Collector(policy, test_envs, exploration_noise=True)
  1. 使用训练器 训练 policy 网络
    当 policy 在测试收集器上达到停止条件stop_fn时,培训器将自动停止培训。由于DQN是一个off-policy算法,我们使用offpolicy_trainer():
result = ts.trainer.offpolicy_trainer(policy, train_collector, test_collector,max_epoch=10, step_per_epoch=10000, step_per_collect=10,update_per_step=0.1, episode_per_test=100, batch_size=64,train_fn=lambda epoch, env_step: policy.set_eps(0.1),test_fn=lambda epoch, env_step: policy.set_eps(0.05),stop_fn=lambda mean_rewards: mean_rewards >= env.spec.reward_threshold)
print(f'Finished training! Use {result["duration"]}')

hyper-parameter-tutorial

  1. 保存 加载权重
torch.save(policy.state_dict(), 'dqn.pth')
policy.load_state_dict(torch.load('dqn.pth'))
  1. 观看Agent的表现
policy.eval()
policy.set_eps(0.05)
collector = ts.data.Collector(policy, env, exploration_noise=True)
collector.collect(n_episode=1, render=1 / 35)
  1. 自己定制训练策略的代码
# pre-collect at least 5000 transitions with random action before training
train_collector.collect(n_step=5000, random=True)policy.set_eps(0.1)
for i in range(int(1e6)):  # total stepcollect_result = train_collector.collect(n_step=10)# once if the collected episodes' mean returns reach the threshold,# or every 1000 steps, we test it on test_collectorif collect_result['rews'].mean() >= env.spec.reward_threshold or i % 1000 == 0:policy.set_eps(0.05)result = test_collector.collect(n_episode=100)if result['rews'].mean() >= env.spec.reward_threshold:print(f'Finished training! Test mean returns: {result["rews"].mean()}')breakelse:# back to training epspolicy.set_eps(0.1)# train policy with a sampled batch data from bufferlosses = policy.update(64, train_collector.buffer)

修改 Deep Q learning 的样例

样例路径:tianshou/test/discrete/test_dqn.py

import os
import gym
import torch
import pickle
import pprint
import argparse
import numpy as np
from torch.utils.tensorboard import SummaryWriterfrom tianshou.policy import DQNPolicy
from tianshou.utils import BasicLogger
from tianshou.env import DummyVectorEnv
from tianshou.utilsmon import Net
from tianshou.trainer import offpolicy_trainer
from tianshou.data import Collector, VectorReplayBuffer, PrioritizedVectorReplayBuffer'''
max_epoch:最大允许的训练轮数,有可能没训练完这么多轮就会停止(因为满足了 stop_fn 的条件)step_per_epoch:每个epoch要更新多少次策略网络collect_per_step:每次更新前要收集多少帧与环境的交互数据。上面的代码参数意思是,每收集10帧进行一次网络更新episode_per_test:每次测试的时候花几个rollout进行测试batch_size:每次策略计算的时候批量处理多少数据train_fn:在每个epoch训练之前被调用的函数,输入的是当前第几轮epoch和当前用于训练的env一共step了多少次。上面的代码意味着,在每次训练前将epsilon设置成0.1test_fn:在每个epoch测试之前被调用的函数,输入的是当前第几轮epoch和当前用于训练的env一共step了多少次。上面的代码意味着,在每次测试前将epsilon设置成0.05stop_fn:停止条件,输入是当前平均总奖励回报(the average undiscounted returns),返回是否要停止训练writer:天授支持 TensorBoard,可以像下面这样初始化::return:
'''
def get_args():parser = argparse.ArgumentParser()parser.add_argument('--task', type=str, default='MsPacman-ram-v0')parser.add_argument('--seed', type=int, default=1626)parser.add_argument('--eps-test', type=float, default=0.05)parser.add_argument('--eps-train', type=float, default=0.1)parser.add_argument('--buffer-size', type=int, default=20000)parser.add_argument('--lr', type=float, default=1e-3)parser.add_argument('--gamma', type=float, default=0.9)parser.add_argument('--n-step', type=int, default=3)parser.add_argument('--target-update-freq', type=int, default=320)parser.add_argument('--epoch', type=int, default=20)parser.add_argument('--step-per-epoch', type=int, default=20000)parser.add_argument('--step-per-collect', type=int, default=10)parser.add_argument('--update-per-step', type=float, default=0.1)parser.add_argument('--batch-size', type=int, default=64)parser.add_argument('--hidden-sizes', type=int,nargs='*', default=[512, 256, 128, 128])parser.add_argument('--training-num', type=int, default=10)parser.add_argument('--test-num', type=int, default=100)parser.add_argument('--logdir', type=str, default='log')parser.add_argument('--render', type=float, default=0.) # 這裡可以修改展示的FPSparser.add_argument('--prioritized-replay',action="store_true", default=False)parser.add_argument('--alpha', type=float, default=0.6)parser.add_argument('--beta', type=float, default=0.4)parser.add_argument('--save-buffer-name', type=str,default="./expert_DQN_MsPacman-ram-v0.pkl")parser.add_argument('--device', type=str,default='cuda' if torch.cuda.is_available() else 'cpu')args = parser.parse_known_args()[0]return argsdef test_dqn(args=get_args()):env = gym.make(args.task)args.state_shape = env.observation_space.shape or env.observation_space.nargs.action_shape = env.action_space.shape or env.action_space.nprint("args.state_shape {} ,args.action_shape {} ".format(args.state_shape,args.action_shape))# train_envs = gym.make(args.task)# you can also use tianshou.env.SubprocVectorEnvtrain_envs = DummyVectorEnv([lambda: gym.make(args.task) for _ in range(args.training_num)])# test_envs = gym.make(args.task)test_envs = DummyVectorEnv([lambda: gym.make(args.task) for _ in range(args.test_num)])# seedprint("seed")np.random.seed(args.seed)torch.manual_seed(args.seed)train_envs.seed(args.seed)test_envs.seed(args.seed)# Q_param = V_param = {"hidden_sizes": [128]}# modelnet = Net(args.state_shape, args.action_shape,hidden_sizes=args.hidden_sizes, device=args.device,# dueling=(Q_param, V_param),).to(args.device)optim = torch.optim.Adam(net.parameters(), lr=args.lr)policy = DQNPolicy(net, optim, args.gamma, args.n_step,target_update_freq=args.target_update_freq)# bufferif args.prioritized_replay:buf = PrioritizedVectorReplayBuffer(args.buffer_size, buffer_num=len(train_envs),alpha=args.alpha, beta=args.beta)else:buf = VectorReplayBuffer(args.buffer_size, buffer_num=len(train_envs))# collectortrain_collector = Collector(policy, train_envs, buf, exploration_noise=True)test_collector = Collector(policy, test_envs, exploration_noise=True)# policy.set_eps(1)train_collector.collect(n_step=args.batch_size * args.training_num)# loglog_path = os.path.join(args.logdir, args.task, 'dqn')writer = SummaryWriter(log_path)logger = BasicLogger(writer)def save_fn(policy):torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth'))def stop_fn(mean_rewards):print("mean_rewards",mean_rewards)return mean_rewards >= 5000def train_fn(epoch, env_step):# eps annnealing, just a demoif env_step <= 10000:policy.set_eps(args.eps_train)elif env_step <= 50000:eps = args.eps_train - (env_step - 10000) / \40000 * (0.9 * args.eps_train)policy.set_eps(eps)else:policy.set_eps(0.1 * args.eps_train)def test_fn(epoch, env_step):policy.set_eps(args.eps_test)# trainerprint("trainer {} ".format("trainer"))result = offpolicy_trainer(policy, train_collector, test_collector, args.epoch,args.step_per_epoch, args.step_per_collect, args.test_num,args.batch_size, update_per_step=args.update_per_step, train_fn=train_fn,test_fn=test_fn, stop_fn=stop_fn, save_fn=save_fn, logger=logger)assert stop_fn(result['best_reward'])if __name__ == '__main__':pprint.pprint(result)# Let's watch its performance!env = gym.make(args.task)policy.eval()policy.set_eps(args.eps_test)collector = Collector(policy, env)result = collector.collect(n_episode=1, render=args.render)rews, lens = result["rews"], result["lens"]print(f"Final reward: {rews.mean()}, length: {lens.mean()}")# save buffer in pickle format, for imitation learning unittestbuf = VectorReplayBuffer(args.buffer_size, buffer_num=len(test_envs))policy.set_eps(0.2)collector = Collector(policy, test_envs, buf, exploration_noise=True)result = collector.collect(n_step=args.buffer_size)pickle.dump(buf, open(args.save_buffer_name, "wb"))print(result["rews"].mean())def test_pdqn(args=get_args()):args.prioritized_replay = Trueargs.gamma = .95args.seed = 1test_dqn(args)if __name__ == '__main__':test_dqn(get_args())

測試訓練結果

import os
import gym
import torch
import pickle
import pprint
import argparse
import numpy as np
from torch.utils.tensorboard import SummaryWriterfrom tianshou.policy import DQNPolicy
from tianshou.utils import BasicLogger
from tianshou.env import DummyVectorEnv
from tianshou.utilsmon import Net
from tianshou.trainer import offpolicy_trainer
from tianshou.data import Collector, VectorReplayBuffer, PrioritizedVectorReplayBuffer
import tianshou as tsenv = gym.make('MsPacman-ram-v0')
state_shape = env.observation_space.shape or env.observation_space.n
action_shape = env.action_space.shape or env.action_space.n
print("args.state_shape {} ,args.action_shape {} ".format(args.state_shape,args.action_shape))
np.random.seed(0519)
torch.manual_seed(0628)net = Net(state_shape, action_shape,hidden_sizes=[512, 256, 128, 128], device='cuda' if torch.cuda.is_available() else 'cpu',# dueling=(Q_param, V_param),).to('cuda' if torch.cuda.is_available() else 'cpu')
optim = torch.optim.Adam(net.parameters(), lr=1e-3)policy = DQNPolicy( net, optim, 0.9, 3, 320)policy.load_state_dict(torch.load('./log/MsPacman-ram-v0/dqn/policy.pth'))for _ in range(100):policy.eval()policy.set_eps(0.05)collector = ts.data.Collector(policy, env, exploration_noise=True)collector.collect(n_episode=1, render=1 / 30)

更多推荐

tianshou + OpenAI GYM 强化学习模型 雅达利游戏环境 (附完整代码)

本文发布于:2024-02-25 09:13:59,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1698479.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:达利   模型   完整   代码   环境

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!