Averaged

编程入门行业动态更新时间:2024-10-09 21:18:31

Averaged

Averaged-DQN：深度强化学习的方差减少和稳定性

Abstract

Instability and variability of Deep Reinforcement Learning (DRL) algorithms tend to adversely affect their performance. Averaged-DQN is a sim-ple extension to the DQN algorithm, based on averaging previously learned Q-values estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. To understand the effect of the algorithm, we examine the source of value function estimation errors and provide an analytical comparison within a simplified model. We further present experiments on the Arcade Learning Environment benchmark that demonstrate significantly improved stability and performance due to the proposed extension.
深度强化学习（DRL）算法的不稳定性和可变性往往会对其性能产生不利影响。 Averaged-DQN是对DQN算法的简单扩展，基于对先前学习的Q值估计进行平均，这通过减少目标值中的近似误差方差而导致更稳定的训练过程和改进的性能。为了理解算法的效果，我们检查了价值函数估计误差的来源，并在简化模型中提供了分析比较。我们进一步展示了Arcade学习环境基准测试的实验，该基准测试表明由于建议的扩展而显着提高了稳定性和性能

Introduction

In Reinforcement Learning (RL) an agent seeks an opti-mal policy for a sequential decision making problem (Sut-ton & Barto, 1998). It does so by learning which action is optimal for each environment state. Over the course of time, many algorithms have been introduced for solv-ing RL problems including Q-learning (Watkins & Dayan, 1992), SARSA (Rummery & Niranjan, 1994; Sutton & Barto, 1998), and policy gradient methods (Sutton et al., 1999). These methods are often analyzed in the setup of linear function approximation, where convergence is guar-anteed under mild assumptions (Tsitsiklis, 1994; Jaakkola et al., 1994; Tsitsiklis & Van Roy, 1997; Even-Dar & Man-sour, 2003). In practice, real-world problems usually in-volve high-dimensional inputs forcing linear function ap-proximation methods to rely upon hand engineered features for problem-specific state representation. These problem-specific features diminish the agent flexibility, and so the need of an expressive and flexible non-linear function ap-proximation emerges. Except for few successful attempts (e.g., TD-gammon, Tesauro (1995)), the combination of non-linear function approximation and RL was considered unstable and was shown to diverge even in simple domains (Boyan & Moore, 1995).

The recent Deep Q-Network (DQN) algorithm (Mnih et al., 2013), was the first to successfully combine a power-ful non-linear function approximation technique known as Deep Neural Network (DNN) (LeCun et al., 1998; Krizhevsky et al., 2012) together with the Q-learning al-gorithm. DQN presented a remarkably flexible and stable algorithm, showing success in the majority of games within the Arcade Learning Environment (ALE) (Bellemare et al., 2013). DQN increased the training stability by breaking the RL problem into sequential supervised learning tasks. To do so, DQN introduces the concept of a target network and uses an Experience Replay buffer (ER) (Lin, 1993).

Following the DQN work, additional modifications and ex-tensions to the basic algorithm further increased training stability. Schaul et al. (2015) suggested sophisticated ER sampling strategy. Several works extended standard RL exploration techniques to deal with high-dimensional input (Bellemare et al., 2016; Tang et al., 2016; Osband et al., 2016). Mnih et al. (2016) showed that sampling from ER could be replaced with asynchronous updates from parallel environments (which enables the use of on-policy meth-ods). Wang et al. (2015) suggested a network architecture base on the advantage function decomposition (Baird III, 1993).
在强化学习（RL）中，代理人为顺序决策问题寻求最优政策（Sut-ton＆Barto，1998）。它是通过了解哪种行为对每种环境状态最佳来实现的。随着时间的推移，已经引入了许多算法来解决RL问题，包括Q-learning（Watkins＆Dayan，1992），SARSA（Rummery＆Niranjan，1994; Sutton＆Barto，1998），以及政策梯度方法（Sutton） et al。，1999）。这些方法通常在线性函数逼近的设置中进行分析，其中收敛在温和的假设下得到保证（Tsitsiklis，1994; Jaakkola等，1994; Tsitsiklis和Van Roy，1997; Even-Dar＆Man-sour，2003; ）。在实践中，现实世界的问题通常包括高维输入，迫使线性函数近似方法依赖于手工设计的特征来解决特定问题的状态。这些特定于问题的特征降低了代理的灵活性，因此出现了表达和灵活的非线性函数近似的需要。除了少数成功的尝试（例如，TD-gammon，Tesauro（1995）），非线性函数近似和RL的组合被认为是不稳定的，并且即使在简单的域中也显示出分歧（Boyan＆Moore，1995）。

最近的深度Q网络（DQN）算法（Mnih等，2013），是第一个成功结合强大的非线性函数逼近技术，称为深度神经网络（DNN）（LeCun等，1998） ; Krizhevsky等，2012）与Q学习算法一起。 DQN提出了一种非常灵活和稳定的算法，在Arcade学习环境（ALE）中的大多数游戏中都取得了成功（Bellemare等，2013）。 DQN通过将RL问题分解为顺序监督学习任务来提高训练稳定性。为此，DQN引入了目标网络的概念并使用了体验重放缓冲区（ER）（Lin，1993）。

在DQN工作之后，对基本算法的额外修改和扩展进一步提高了训练稳定性。 Schaul等人。（2015）建议复杂的ER采样策略。一些工作扩展了标准RL探测技术以处理高维输入（Bellemare等，2016; Tang等，2016; Osband等，2016）。 Mnih等人。（2016）表明，ER的抽样可以替换为来自并行环境的异步更新（这使得能够使用政策上的方法）。王等人。（2015）提出了基于优势函数分解的网络架构（Baird III，1993）。
In this work we address issues that arise from the combi-nation of Q-learning and function approximation. Thrun & Schwartz (1993) were first to investigate one of these issues which they have termed as the overestimation phenomena. The max operator in Q-learning can lead to overestimation of state-action values in the presence of noise. Van Hasselt et al. (2015) suggest the Double-DQN that uses the Double Q-learning estimator (Van Hasselt, 2010) method as a solu-tion to the problem. Additionally, Van Hasselt et al. (2015) showed that Q-learning overestimation do occur in practice.
(at least in the ALE).

This work suggests a different solution to the overestima-tion phenomena, named Averaged-DQN (Section 3), based on averaging previously learned Q-values estimates. The averaging reduces the target approximation error variance (Sections 4 and 5) which leads to stability and improved results. Additionally, we provide experimental results on selected games of the Arcade Learning Environment.

We summarize the main contributions of this paper as fol-lows:

• A novel extension to the DQN algorithm which stabi-lizes training, and improves the attained performance, by averaging over previously learned Q-values.

• Variance analysis that explains some of the DQN problems, and how the proposed extension addresses them.

• Experiments with several ALE games demonstrating the favorable effect of the proposed scheme.
在这项工作中，我们解决了Q学习和函数逼近的组合所产生的问题。 Thrun＆Schwartz（1993）首先研究了其中一个被称为过高估计现象的问题。 Q学习中的最大算子可能导致在存在噪声的情况下高估状态 - 动作值。范哈塞特等人。（2015）建议使用Double Q-learning估计器（Van Hasselt，2010）方法的Double-DQN作为问题的解决方案。此外，Van Hasselt等人。（2015）表明Q-learning高估确实在实践中发生。
（至少在ALE中）。

这项工作提出了一种不同的高估现象解决方案，名为Averaged-DQN（第3节），基于对先前学习的Q值估计进行平均。平均值降低了目标近似误差方差（第4节和第5节），从而提高了稳定性并改善了结果。此外，我们还提供了Arcade学习环境选定游戏的实验结果。

我们总结了本文的主要贡献如下：

•DQN算法的新扩展，通过对先前学习的Q值进行平均，稳定训练并提高获得的性能。

•方差分析，解释了一些DQN问题，以及建议的扩展如何解决这些问题。

•几个ALE游戏的实验证明了该方案的有利影响。

2.Background

In this section we elaborate on relevant RL background, and specifically on the Q-learning algorithm.
在本节中，我们将详细介绍相关的RL背景，特别是Q学习算法。

2.1. Reinforcement Learning

Value-based methods for solving RL problems encode poli-cies through the use of value functions, which denote the expected discounted cumulative reward from a given state s, following a policy π. Specifically we are interested in state-action value functions:
解决RL问题的基于价值的方法通过使用价值函数来编码政策，价值函数表示遵循政策π的给定州s的预期折现累积奖励。具体来说，我们对状态 - 动作值函数感兴趣：

2.2. Q-learning

One of the most popular RL algorithms is the Q-learning algorithm (Watkins & Dayan, 1992). This algorithm is based on a simple value iteration update (Bellman, 1957), directly estimating the optimal value function Q . Tabular Q-learning assumes a table that contains old action-value function estimates and preform updates using the follow-ing update rule:

最流行的RL算法之一是Q学习算法（Watkins＆Dayan，1992）。该算法基于简单的值迭代更新（Bellman，1957），直接估计最优值函数Q. 表格式Q-learning假设一个表包含旧的操作 - 值函数估计和使用以下更新规则的预成型更新：

2.3. Deep Q Networks (DQN)

We present in Algorithm 1 a slightly different formulation of the DQN algorithm (Mnih et al., 2013). In iteration i the DQN algorithm solves a supervised learning problem to approximate the action-value function Q(s, a; θ) (line 6). This is an extension of implementing (1) in its function ap-proximation form (Riedmiller, 2005).
我们在算法1中提出了一种略微不同的DQN算法公式（Mnih等，2013）。在迭代i中，DQN算法解决了监督学习问题以近似动作值函数Q（s，a;θ）（第6行）。这是在函数ap-proximation形式（Riedmiller，2005）中实现（1）的扩展。

Note that in the original implementation(Mnihetal.,2013; 2015), transitions are added to the ER buffer simultaneously with the minimization of the DQN loss (line 6). Using the hyperparameters employed by Mnih et al. (2013; 2015)(detailed for completeness in AppendixE),1%of the experience transitions in ER buffer are replaced between target network parameter updates, and 8% are sampled for minimization.
请注意，在最初的实现中（Mnihetal。，2013; 2015），转换被同时添加到ER缓冲区，同时最小化DQN损失（第6行）。使用Mnih等人使用的超参数。（2013; 2015）（详见附录E中的完整性），ER缓冲区中1％的经验转换在目标网络参数更新之间被替换，8％被采样用于最小化。

3.AveragedDQN

The Averaged-DQN algorithm (Algorithm 2) is an extension of the DQN algorithm. Averaged-DQN uses the K previously learned Q-values estimates to produce the current action-value estimate (line 5). The Averaged-DQN algorithm stabilizes the training process (see Figure 1), by reducing the variance of target approximation error as we elaborate in Section 5. The computational effort compared to DQN is, K-fold more forward passes through a Q-network while minimizing the DQN loss (line 7). The number of back-propagation updates remains the same as in DQN. Computational cost experiments are provided in AppedixD. The output of the algorithm is the average over the last K previously learned Q-networks.
Averaged-DQN算法（算法2）是DQN算法的扩展。 Averaged-DQN使用K先前学习的Q值估计来产生当前动作值估计（第5行）。 Averaged-DQN算法通过减少目标近似误差的方差来稳定训练过程（见图1），正如我们在第5节中详细说明的那样。与DQN相比，计算工作量是向前通过Q网络的K倍。最小化DQN损失（第7行）。反向传播更新的数量与DQN中的相同。 AppedixD中提供了计算成本实验。算法的输出是最近K个先前学习的Q网络的平均值。

Figure 1. DQN and Averaged-DQN performance in the Atari game of BREAKOUT. The bold lines are averages over seven independent learning trials. Every 1M frames, a performance test using �-greedy policy with � = 0.05 for 500000 frames was conducted. The shaded area presents one standard deviation. For both DQN and Averaged-DQN the hyperparameters used were taken from Mnih et al. (2015).
图1. BREAKOUT的Atari游戏中的DQN和Averaged-DQN性能。大胆的线条是七次独立学习试验的平均值。每1M帧，使用�贪婪策略进行性能测试，对于500000帧，�= 0.05。阴影区域呈现一个标准偏差。对于DQN和Averaged-DQN，使用的超参数取自Mnih等人。（2015年）。

In Figures1 and 2 we can see the performance of Averaged-DQN compared to DQN(and Double-DQN),further experimental results are given in Section 6. We note that recently-learned state-action value estimates are likely to be better than older ones, therefore we have also considered a recency-weighted average. In practice, a weighted average scheme did not improve performance and therefore is not presented here.
在图1和图2中，我们可以看到Averaged-DQN与DQN（和Double-DQN）相比的性能，进一步的实验结果在第6节中给出。我们注意到最近学到的状态 - 动作值估计可能比旧的更好。因此，我们也考虑了新近加权平均值。在实践中，加权平均方案没有改善性能，因此这里没有介绍。

4. Overestimation and Approximation Errors

Next, we discuss the various types of errors that arise due to the combination of Q-learning and function approximation in the DQN algorithm, and their effect on training stability. We refer to DQN’s performance in the BREAKOUT game in Figure 1. The source of the learning curve variance in DQN’s performance is an occasional sudden drop in the average score that is usually recovered in the next evalua-tion phase (for another illustration of the variance source see Appendix A). Another phenomenon can be observed in Figure 2, where DQN initially reaches a steady state (after 20 million frames), followed by a gradual deterioration in performance.

For the rest of this section, we list the above mentioned er-rors, and discuss our hypothesis as to the relations between each error and the instability phenomena depicted in Fig-ures 1 and 2.
接下来，我们讨论由于DQN算法中Q学习和函数逼近的组合而产生的各种类型的错误，以及它们对训练稳定性的影响。我们在图1中的BREAKOUT游戏中参考DQN的表现.DQN表现中学习曲线方差的来源是平均得分的偶然突然下降，通常在下一个评估阶段恢复（另一个方差说明）来源见附录A）。在图2中可以观察到另一种现象，其中DQN最初达到稳定状态（在2000万帧之后），随后性能逐渐恶化。

在本节的其余部分，我们列出了上面提到的错误，并讨论了我们关于每个错误与图1和图2中描述的不稳定现象之间关系的假设。

Figure 2. DQN, Double-DQN, and Averaged-DQN performance (left), and average value estimates (right) in the Atari game of ASTERIX. The bold lines are averages over seven independent learning trials. The shaded area presents one standard deviation. Every 2M frames, a performance test using -greedy policy with = 0.05 for 500000 frames was conducted. The hyperparameters used were taken from Mnih et al. (2015).
图2. ASTERIX的Atari游戏中的DQN，Double-DQN和Averaged-DQN性能（左）以及平均值估计（右）。大胆的线条是七次独立学习试验的平均值。阴影区域呈现一个标准偏差。每2M帧，使用-greedy策略进行性能测试，对于500000帧使用= 0.05。使用的超参数取自Mnih等人。（2015年）。

The optimality difference can be seen as the error of a standard tabular Q-learning, here we address the other errors. We next discuss each error in turn.
最优性差异可以看作是标准表格Q学习的错误，这里我们解决其他错误。我们接下来依次讨论每个错误。

4.1. Target Approximation Error (TAE)

We hypothesize that the variability in DQN’s performance in Figure 1, that was discussed at the start of this section, is related to deviating from a steady-state policy induced by the TAE.
我们假设在本节开始时讨论的图1中DQN性能的变化与偏离TAE引起的稳态策略有关。

4.2. Overestimation Error

The overestimation error is different in its nature from the TAE since it presents a positive bias that can cause asymp-totically sub-optimal policies, as was shown by Thrun & Schwartz (1993), and later by Van Hasselt et al. (2015) in the ALE environment. Note that a uniform bias in the action-value function will not cause a change in the induced policy. Unfortunately, the overestimation bias is uneven and is bigger in states where the Q-values are similar for the different actions, or in states which are the start of a long trajectory (as we discuss in Section 5 on accumulation of TAE variance).

Following from the above mentioned overestimation upper bound, the magnitude of the bias is controlled by the variance of the TAE.

The Double Q-learning and its DQN implementation (Double-DQN) (Van Hasselt et al., 2015; Van Hasselt, 2010) is one possible approach to tackle the overestimation problem, which replaces the positive bias with a negative one. Another possible remedy to the adverse effects of this error is to directly reduce the variance of the TAE, as in our proposed scheme (Section 5).

In Figure 2 we repeated the experiment presented in Van Hasselt et al. (2015) (along with the application of Averaged-DQN). This experiment is discussed in Van Has-selt et al. (2015) as an example of overestimation that leads to asymptotically sub-optimal policies. Since Averaged-DQN reduces the TAE variance, this experiment supports an hypothesis that the main cause for overestimation in DQN is the TAE variance.

高估误差在性质上与TAE不同，因为它提出了一个正偏差，可以导致全面的次优政策，如Thrun＆Schwartz（1993）和后来的Van Hasselt等人所示。（2015年）在ALE环境中。请注意，动作值函数中的统一偏差不会导致诱导策略的变化。不幸的是，高估偏差是不均匀的，并且在不同行为的Q值相似的状态下，或在作为长轨迹开始的状态中更高（正如我们在第5节中讨论的TAE方差累积）。

根据上述高估估计的上限，偏差的大小由TAE的方差控制。

双Q学习及其DQN实施（Double-DQN）（Van Hasselt等，2015; Van Hasselt，2010）是解决高估问题的一种可能方法，用负面替代正偏差。对于这种误差的不利影响，另一种可能的补救措施是直接减少TAE的方差，如我们提出的方案（第5节）。

在图2中，我们重复了Van Hasselt等人的实验。（2015）（以及Averaged-DQN的应用）。 Van Has-selt等人讨论了该实验。（2015）作为过高估计的一个例子，导致渐近次优政策。由于Averaged-DQN降低了TAE方差，因此该实验支持一个假设，即DQN中高估的主要原因是TAE方差。

5. TAE Variance Reduction

5.1. DQN Variance

We assume the statistical model mentioned at the start of this section. Consider a unidirectional Markov Decision Process (MDP) as in Figure 3, where the agent starts at state s0, state sM −1 is a terminal state, and the reward in any state is equal to zero.

Employing DQN on this MDP model, we get that for i > M:

我们假设本节开头提到的统计模型。考虑如图3所示的单向马尔可夫决策过程（MDP），其中代理在状态s0开始，状态sM -1是终端状态，并且任何状态的奖励等于零。

在这个MDP模型上使用DQN，我们得到i> M：

The above example gives intuition about the behavior of the TAE variance in DQN. The TAE is accumulated over the past DQN iterations on the updates trajectory. Accu-mulation of TAE errors results in bigger variance with its associated adverse effect, as was discussed in Section 4.

上面的例子给出了关于DQN中TAE方差行为的直觉。 TAE在更新轨迹上的过去DQN迭代上累积。如第4节所述，TAE误差的准确性导致更大的方差及其相关的不利影响。

5.2. Ensemble DQN Variance

We consider two approaches for TAE variance reduction. The first one is the Averaged-DQN and the second we term Ensemble-DQN. We start with Ensemble-DQN which is a straightforward way to obtain a 1/K variance reduction, with a computational effort of K-fold learning problems, compared to DQN. Ensemble-DQN (Algorithm 3) solves K DQN losses in parallel, then averages over the resulted Q-values estimates.
For Ensemble-DQN on the unidirectional MDP in Figure 3, we get for i > M:

我们考虑两种减少TAE方差的方法。第一个是Averaged-DQN，第二个是Ensemble-DQN。我们从Ensemble-DQN开始，这是一种直接的方法来获得1 / K方差减少，与DQN相比，计算费用为K倍学习问题。 Ensemble-DQN（算法3）并行地求解K DQN损失，然后对得到的Q值估计求平均值。
对于图3中单向MDP上的Ensemble-DQN，我们得到i> M：

5.3. Averaged DQN Variance

We continue with Averaged-DQN, and calculate the vari-ance in state s0 for the unidirectional MDP in Figure 3. We get that for i > KM:

我们继续使用Averaged-DQN，并计算图3中单向MDP的状态s0的变化。我们得到i> KM：

meaning that Averaged-DQN is theoretically more efficient in TAE variance reduction than Ensemble-DQN, and at least K times better than DQN. The intuition here is that Averaged-DQN averages over TAEs averages, which are the value estimates of the next states.
这意味着，平均DQN理论上在TAE方差减少方面比Ensemble-DQN更有效，并且至少比DQN好K倍。这里的直觉是，平均DQN平均超过TAE平均值，这是下一个州的估值。

6. Experiments

The experiments were designed to address the following questions:

How does the number K of averaged target networks affect the error in value estimates, and in particular the overestimation error.
How does the averaging affect the learned polices quality.

To that end, we ran Averaged-DQN and DQN on the ALE benchmark. Additionally, we ran Averaged-DQN, Ensemble-DQN, and DQN on a Gridworld toy problem where the optimal value function can be computed exactly.

这些实验旨在解决以下问题：

平均目标网络的数量K如何影响值估计中的误差，特别是高估误差。
平均值如何影响学习策略的质量。

为此，我们在ALE基准测试中运行了Averaged-DQN和DQN。此外，我们在Gridworld玩具问题上运行了Averaged-DQN，Ensemble-DQN和DQN，其中可以精确计算最佳值函数。

6.1. Arcade Learning Environment (ALE)

To evaluate Averaged-DQN, we adopt the typical RL methodology where agent performance is measured at the end of training. We refer the reader to Liang et al. (2016) for further discussion about DQN evaluation methods on the ALE benchmark. The hyperparameters used were taken from Mnih et al. (2015), and are presented for complete-ness in Appendix E. DQN code was taken from McGill University RLLAB, and is available online1 (together with Averaged-DQN implementation).

We have evaluated the Averaged-DQN algorithm on three Atari games from the Arcade Learning Environment (Bellemare et al., 2013). The game of BREAKOUT was selected due to its popularity and the relative ease of the DQN to reach a steady state policy. In contrast, the game of SEAQUEST was selected due to its relative complexity, and the significant improvement in performance obtained by other DQN variants (e.g., Schaul et al. (2015); Wang et al. (2015)). Finally, the game of ASTERIX was presented in Van Hasselt et al. (2015) as an example to overestimation in DQN that leads to divergence.

As can be seen in Figure 4 and in Table 1 for all three games, increasing the number of averaged networks in Averaged-DQN results in lower average values estimates, better-preforming policies, and less variability between the runs of independent learning trials. For the game of AS-TERIX, we see similarly to Van Hasselt et al. (2015) that the divergence of DQN can be prevented by averaging.

Overall, the results suggest that in practice Averaged-DQN reduces the TAE variance, which leads to smaller overes-timation, stabilized learning curves and significantly im-proved performance.

为了评估Averaged-DQN，我们采用典型的RL方法，其中在训练结束时测量代理性能。我们将读者推荐给Liang等人。（2016）关于ALE基准的DQN评估方法的进一步讨论。使用的超参数取自Mnih等人。（2015年），并在附录E中提供完整性.DQN代码取自麦吉尔大学RLLAB，可在线获得1（与Averaged-DQN实施一起）。

我们已经从Arcade学习环境评估了三个Atari游戏的Averaged-DQN算法（Bellemare等，2013）。选择BREAKOUT的比赛是因为它的受欢迎程度以及DQN达到稳定政策的相对容易程度。相比之下，选择SEAQUEST游戏是因为其相对复杂性，以及其他DQN变体（例如，Schaul等人（2015）; Wang等人（2015））获得的性能的显着改善。最后，AS Hasix的游戏在Van Hasselt等人的作品中展示。（2015）作为DQN过高估计导致分歧的一个例子。

从图4和表1中可以看出，对于所有三个游戏，增加Averaged-DQN中的平均网络数量会导致较低的平均值估计值，更好的预先形成的策略以及独立学习试验运行之间的可变性更小。对于AS-TERIX的游戏，我们看到类似于Van Hasselt等人。（2015）可以通过平均来防止DQN的分歧。

总体而言，结果表明，在实践中，Averaged-DQN降低了TAE方差，从而导致更小的过热度，稳定的学习曲线和显着改善的性能。

6.2. Gridworld

The Gridworld problem (Figure 5) is a common RL bench-mark (e.g., Boyan & Moore (1995)). As opposed to the ALE, Gridworld has a smaller state space that allows the ER buffer to contain all possible state-action pairs. Addi-tionally, it allows the optimal value function Q to be accurately computed.

For the experiments, we have used Averaged-DQN, and Ensemble-DQN with ER buffer containing all possible state-action pairs. The network architecture that was used composed of a small fully connected neural network with one hidden layer of 80 neurons. For minimization of the DQN loss, the ADAM optimizer (Kingma & Ba, 2014) was used on 100 mini-batches of 32 samples per target network parameters update in the first experiment, and 300 mini-batches in the second.

Gridworld问题（图5）是一个常见的RL基准（例如，Boyan＆Moore（1995））。与ALE相反，Gridworld具有较小的状态空间，允许ER缓冲区包含所有可能的状态 - 动作对。另外，它允许精确计算最佳值函数Q.

对于实验，我们使用了Averaged-DQN和Ensemble-DQN以及包含所有可能的状态 - 动作对的ER缓冲区。使用的网络架构由一个小的完全连接的神经网络和一个隐藏的80个神经元层组成。为了最大限度地减少DQN损失，ADAM优化器（Kingma＆Ba，2014）用于第一个实验中每个目标网络参数更新的100个小批量32个样本，第二个实验中有300个小批量。

Figure 4. The top row shows Averaged-DQN performance for the different number K of averaged networks on three Atari games. For K= 1 Averaged-DQN is reduced to DQN. The bold lines are averaged over seven independent learning trials. Every 2M frames, a performance test using -greedy policy with = 0.05 for 500000 frames was conducted. The shaded area presents one standard deviation. The bottom row shows the average value estimates for the three games. It can be seen that as the number of averaged networks is increased, overestimation of the values is reduced, performance improves, and less variability is observed. The hyperparameters used were taken from Mnih et al. (2015).

图4.顶行显示三个Atari游戏中不同数量K的平均网络的Averaged-DQN性能。对于K = 1，平均DQN减少到DQN。七个独立的学习试验平均粗线。每2M帧，使用-greedy策略进行性能测试，对于500000帧使用= 0.05。阴影区域呈现一个标准偏差。底行显示了三场比赛的平均值估计值。可以看出，随着平均网络的数量增加，对值的过高估计减少，性能提高，并且观察到较小的可变性。使用的超参数取自Mnih等人。（2015年）。

6.ENVIRONMENT SETUP

Figure 5. Gridworld problem. The agent starts at the left-bottom of the grid. In the upper-right corner, a reward of +1 is obtained.

图5. Gridworld问题。代理程序从网格的左下角开始。在右上角，获得+1的奖励。

6.2.2. OVERESTIMATION

In Figure 6 it can be seen that increasing the number K of averaged target networks leads to reduced overestimation eventually. Also, more averaged target networks seem to reduces the overshoot of the values, and leads to smoother and less inconsistent convergence.

在图6中可以看出，增加平均目标网络的数量K最终导致过高估计。此外，更平均的目标网络似乎减少了值的过冲，并且导致更平滑和更不一致的收敛。

6.2.3. AVERAGED VERSUS ENSEMBLE DQN

In Figure 7, it can be seen that as was predicted by the analysis in Section 5, Ensemble-DQN is also inferior to Averaged-DQN regarding variance reduction, and as a consequence far more overestimates the values. We note that Ensemble-DQN was not implemented for the ALE exper-iments due to its demanding computational effort, and the empirical evidence that was already obtained in this simple Gridworld domain.
在图7中，可以看出，正如第5节中的分析预测的那样，Ensemble-DQN在方差减少方面也不如Averaged-DQN，因此更加高估了这些值。我们注意到Ensemble-DQN没有为ALE实验实现，因为它需要大量的计算工作，以及在这个简单的Gridworld领域已经获得的经验证据。

Figure 6. Averaged-DQN average predicted value in Gridworld. Increasing the number K of averaged target networks leads to a faster convergence with less overestimation (positive-bias). The bold lines are averages over 40 independent learning trials, and the shaded area presents one standard deviation. In the figure, A,B,C,D present DQN, and Averaged-DQN for K=5,10,20 aver-age overestimation.

Figure 7. Averaged-DQN and Ensemble-DQN predicted value in Gridworld. Averaging of past learned value is more beneficial than learning in parallel. The bold lines are averages over 20 independent learning trials, where the shaded area presents one standard deviation.
图6. Gridworld中的Averaged-DQN平均预测值。增加平均目标网络的数量K会导致更快的收敛，而不会过高估计（正偏差）。粗线是40多个独立学习试验的平均值，阴影区域呈现一个标准差。在图中，A，B，C，D表示DQN，而平均DQN表示K = 5,10,20平均高估。

图7. Gridworld中的Averaged-DQN和Ensemble-DQN预测值。平均过去的学习价值比并行学习更有利。粗线是20多个独立学习试验的平均值，其中阴影区域呈现一个标准偏差。

7. Discussion and Future Directions

In this work, we have presented the Averaged-DQN algo-rithm, an extension to DQN that stabilizes training, and im-proves performance by efficient TAE variance reduction. We have shown both in theory and in practice that the pro-posed scheme is superior in TAE variance reduction, com-pared to a straightforward but computationally demanding approach such as Ensemble-DQN (Algorithm 3). We have demonstrated in several games of Atari that increasing the number K of averaged target networks leads to better poli-

cies while reducing overestimation. Averaged-DQN is a simple extension that can be easily integrated with other DQN variants such as Schaul et al. (2015); Van Hasselt et al. (2015); Wang et al. (2015); Bellemare et al. (2016); He et al. (2016). Indeed, it would be of interest to study the added value of averaging when combined with these variants. Also, since Averaged-DQN has variance reduc-tion effect on the learning curve, a more systematic com-parison between the different variants can be facilitated as discussed in (Liang et al., 2016).

In future work, we may dynamically learn when and how many networks to average for best results. One simple sug-gestion may be to correlate the number of networks with the state TD-error, similarly to Schaul et al. (2015). Finally, incorporating averaging techniques similar to Averaged-DQN within on-policy methods such as SARSA and Actor-Critic methods (Mnih et al., 2016) can further stabilize these algorithms.

在这项工作中，我们提出了Averaged-DQN算法，这是DQN的扩展，可以稳定训练，并通过有效的TAE方差减少来提高性能。我们已经在理论和实践中表明，与TAse方差减少相比，提出的方案更优越，与Ensemble-DQN（算法3）等简单但计算要求更高的方法相比。我们已经在几个Atari游戏中证明，增加平均目标网络的数量K会导致更好的策略

在减少高估的同时减少。 Averaged-DQN是一个简单的扩展，可以很容易地与其他DQN变体集成，如Schaul等。（2015）;范哈塞特等人。（2015）;王等人。（2015）; Bellemare等。（2016）;他等人。（2016）。实际上，当与这些变体结合时，研究平均值的附加值将是有意义的。此外，由于Averaged-DQN对学习曲线具有方差减少效应，因此可以促进不同变体之间更系统的比较，如（Liang et al。，2016）中所讨论的。

在未来的工作中，我们可以动态地了解何时以及为了获得最佳结果而平均的网络数量。一个简单的建议可能是将网络数量与状态TD误差相关联，类似于Schaul等人。（2015年）。最后，在诸如SARSA和Actor-Critic方法（Mnih等，2016）等政策方法中引入类似于Averaged-DQN的平均技术可以进一步稳定这些算法。

References

Arthur E Bryson, Yu Chi Ho. Applied Optimal Control: Optimization Estimation and Control. Hemisphere Pub-lishing, 1975.

Baird III, Leemon C. Advantage updating. Technical re-port, DTIC Document, 1993.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

Bellemare, Marc G, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Uni-fying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868, 2016.

Bellman, Richard. A Markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957.

Boyan, Justin and Moore, Andrew W. Generalization in reinforcement learning: Safely approximating the value function. Advances in neural information processing systems, pp. 369–376, 1995.

Even-Dar, Eyal and Mansour, Yishay. Learning rates for q-learning. Journal of Machine Learning Research, 5 (Dec):1–25, 2003.

He, Frank S., Yang Liu, Alexander G. Schwing, and Peng, Jian. Learning to play in a day: Faster deep reinforce-ment learning by optimality tightening. arXiv preprint arXiv:1611.01606, 2016.

Jaakkola, Tommi, Jordan, Michael I, and Singh, Satinder P. On the convergence of stochastic iterative dynamic pro-gramming algorithms. Neural Computation, 6(6):1185– 1201, 1994.

Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2014.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in NIPS, pp. 1097–1105, 2012.

LeCun, Yann, Bottou, Leon,´ Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998.

Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowl-ing, Michael. State of the art control of Atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 485–493, 2016.

Lin, Long-Ji. Reinforcement learning for robots using neu-ral networks. Technical report, DTIC Document, 1993.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing Atari with deep reinforce-ment learning. arXiv preprint arXiv:1312.5602, 2013.

Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.

Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.

Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped DQN. arXiv preprint arXiv:1602.04621, 2016.

Riedmiller, Martin. Neural fitted Q iteration–first experi-ences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.

Rummery, Gavin A and Niranjan, Mahesan. On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering, 1994.

Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Sil-ver, David. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.

Sutton, Richard S and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press Cambridge, 1998.

Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for re-inforcement learning with function approximation. In NIPS, volume 99, pp. 1057–1063, 1999.

Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, and Filip De Turck, Pieter Abbeel. #exploration: A study of count-based ex-ploration for deep reinforcement learning. arXiv preprint arXiv:1611.04717, 2016.

Tesauro, Gerald. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.

Thrun, Sebastian and Schwartz, Anton. Issues in using function approximation for reinforcement learning. In

Proceedings of the 1993 Connectionist Models Summer

School Hillsdale, NJ. Lawrence Erlbaum, 1993.

Tsitsiklis, John N. Asynchronous stochastic approxima-

tion and q-learning. Machine Learning, 16(3):185–202,

Tsitsiklis, John N and Van Roy, Benjamin. An analysis

of temporal-difference learning with function approxi-

mation. IEEE transactions on automatic control, 42(5):

674–690, 1997.

Van Hasselt, Hado. Double Q-learning. In Lafferty, J. D.,

Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and