OpenAI Five self

编程入门行业动态更新时间:2024-10-28 05:14:56

OpenAI Five self

文章目录

Blog
- OpenAI Five
- OpenAI Five Benchmark
- OpenAI Five Benchmark：Results
- OpenAI Five Defeats Dota 2 World Champions
Paper
- Dota2 2 with Large Scale Deep Reinforcement Learning

Blog

OpenAI Five

参考资料：/
Dota 2 存在的问题：

long time horizons
partially observed state
high-dimensional, continuous action space
high-dimensional, continuous observation space

解决方式

massively-scaled PPO + self-play
硬件资源：
长序列决策
以往认为的长序列决策需要分层强化学习等特殊方法，但文章认为我们显然对现有的算法不够自信，至少要有足够的运行规模和合理的探索方式这些算法才能展现出真正的价值。

通过调节gamma系数改变未来奖励对当前影响的半衰期
通过选手的对战体验发现Bot补刀不行，反面验证是由于长期的推塔策略导致的短期回报的牺牲。

模型结构
利用LSTM解决POMDP问题
探索

self-play：OpenAI Five learns from self-play (starting from random weights), which provides a natural curriculum for exploring the environment. To avoid “strategy collapse”, the agent trains 80% of its games against itself and the other 20% against its past selves.
randomization：OpenAI Five uses the randomizations we wrote for our 1v1 bot. It also uses a new “lane assignment” one. At the beginning of each training game, we randomly “assign” each hero to some subset of lanes and penalize it for straying from those lanes until a randomly-chosen time in the game.(通过迁移算法中的的场景随机化方法增强策略泛化性，兵线任务感觉有点像competitive self-play中的annealing explora reward的作用，都是加入专家知识来加快训练收敛速度和引导策略方向)
reward：We postprocess each agent’s reward by subtracting the other team’s average reward to prevent the agents from finding positive-sum situations.（奖励是敌我奖励差值，零和博弈）

合作
OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.（基于专家知识）
训练框架

The latencies for synchronizing 58MB of data (size of OpenAI Five’s parameters) across different numbers of GPUs are shown on the right. The latency is low enough to be largely masked by GPU computation which runs in parallel with it.

OpenAI Five Benchmark

参考资料：/

OpenAI Five Benchmark：Results

参考资料：[/](/
transfer

版本迭代时，参数如何初始化： this version of OpenAI Five contains parameters that have been training since June 9th across six major system revisions. Each revision was initialized with parameters from the previous one.
网络结构版本更新时，参数怎么转移：We invested heavily in “surgery” tooling which allows us to map old parameters to a new network architecture. For example, when we first trained warding, we shared a single action head for determining where to move and where to place a ward. But Five would often drop wards seemingly in the direction it was trying to go, and we hypothesized it was allocating its capacity primarily to movement. Our tooling let us split the head into two clones initialized with the same parameters.

OpenAI Five Defeats Dota 2 World Champions

Why Dota?
We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale. Achieving and utilizing that scale wasn’t easy and was the bulk of our research effort!(训练规模足够大就能起效果)
现在的RL算法的超强能力来自与超大量的样本经验，之后的挑战将会是减少样本数量。
compute
In total, the current version of OpenAI Five has consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months (up from about 10,000 years over 1.5 realtime months as of The International), for an average of 250 years of simulated experience per day. The Finals version of OpenAI Five has a 99.9% winrate versus the TI version.[2]
transfer learning
more heroes
扩大英雄池之后学到了95%但之后增长慢但还在增强，猜测原因有很多，比如模型能力不足，需要更好的匹配机制，需要更多的训练时间让新英雄的熟练度达到老英雄水平。

Paper

Dota2 2 with Large Scale Deep Reinforcement Learning

1. introduction

dota2 的问题
The key ingredient in solving this complex environment was to scale existing reinforcement
learning systems to unprecedented levels, utilizing thousands of GPUs over multiple months.
One challenge we faced in training was that the environment and code continually changed as
our project progressed. In order to train without restarting from the beginning after each change,
we developed a collection of tools to resume training with minimal loss in performance which we
call surgery.

2. Dota 2
3. training system