admin管理员组文章数量:1657207
强化学习保守策略迭代Conservative policy iteration推导
- 前言
- Greedy policy
- Conservative Policy Iteration
- Lemma 1 (Performance difference lemma)
- Lemma 1的另一种表达形式
- Lemma2
- 单调改进
前言
最近在学习TRPO,但是本人发现TRPO用了好多好多好多前人的论文结论,包括策略梯度定理 (Policy gradient theorem, PGT),自然策略梯度 (Natural policy gradient, NPG)和保守策略迭代 (Conservative policy iteration, CPI)。
因此,本人最近更新的博客数量剧增。。。PGT 和 NPG 可以参考我的另外两个博客
PGT.
NPG.
CPI 的主要贡献在于它提出了一种混合策略更新的方法,并且给出了使用这种方法更新策略可以使值函数单调增加的证明。可以设想一下,如果值函数单调不减,那么理论上是可以说明得到的策略是越来越好的。
实际上,CPI 算法并不会被简单粗暴地应用在程序中,因为结论中提到的变量都是未知的,都是很难确定的,所以CPI 算法更多地是提供一种解决一类RL问题的思路。然后,十几年之后,TRPO的提出就是这个思路被成功应用的很典型的一个案例。
Greedy policy
贪婪策略是十分常见的一种策略更新方式,每一回合计算出优势函数
A
π
o
l
d
(
s
,
a
)
A^{\pi_{old}}(s,a)
Aπold(s,a),然后反向找到使这个
A
π
o
l
d
(
s
,
a
)
A^{\pi_{old}}(s,a)
Aπold(s,a)最大的策略
π
\pi
π作为
π
n
e
w
\pi_{new}
πnew,简单粗暴。(有的时候也会用
Q
π
o
l
d
(
s
,
a
)
Q^{\pi_{old}}(s,a)
Qπold(s,a),本质上一样),即:
π
n
e
w
=
arg
max
π
∈
Π
E
s
∼
d
μ
π
[
A
π
(
s
,
a
)
]
\begin{align} \begin{aligned} \pi_{new}=\arg\max_{\pi\in\Pi}{\mathbb{E}_{s\sim d_{\mu}^{\pi}}{\left[A^{\pi}(s,a)\right]}} \end{aligned} \end{align}
πnew=argπ∈ΠmaxEs∼dμπ[Aπ(s,a)]
但是这就存在一个问题,如果两次的改进比较大,那么很有可能会导致由 π o l d \pi_{old} πold和 π n e w \pi_{new} πnew所诱导出的轨迹存在较大差别。考虑到此时值函数并没有收敛 (甚至刚开始学习),那么就会导致策略在一定范围内“反复横跳” (哎,就是玩,你说啥都没用,我就不收敛,你能咋的…>_<)。这也是混合策略提出的初衷。
Conservative Policy Iteration
CPI将策略的更新变成旧策略和临时策略的加权平均,即
{
π
′
=
arg
max
π
∈
Π
E
s
∼
d
μ
π
o
l
d
[
A
π
(
s
,
a
)
]
π
n
e
w
=
(
1
−
α
)
π
+
α
π
′
\begin{align} \left\{ \begin{aligned} & \pi'=\arg\max_{\pi\in\Pi}{\mathbb{E}_{s\sim d_{\mu}^{\pi_{old}}}{\left[A^{\pi}(s,a)\right]}} \\ & \pi_{new}=(1-\alpha)\pi+\alpha\pi' \end{aligned} \right. \end{align}
⎩
这个式子给出来挺容易,其实到这里就已经结束了,问题是,如何说明它确实“有所提高”。这里给出三个引理
Lemma 1 (Performance difference lemma)
对于两个策略
π
\pi
π和
π
′
\pi'
π′,以及他们所诱导出的状态的分布
d
μ
π
d_{\mu}^{\pi}
dμπ和
d
μ
π
′
d_{\mu}^{\pi'}
dμπ′,有
{
V
π
′
(
s
0
)
−
V
π
(
s
0
)
=
1
1
−
γ
E
s
∼
d
μ
π
′
[
A
π
(
s
,
π
′
(
s
)
)
]
V
π
(
s
0
)
−
V
π
′
(
s
0
)
=
1
1
−
γ
E
s
∼
d
μ
π
[
A
π
′
(
s
,
π
(
s
)
)
]
\begin{align} \left\{ \begin{aligned} & V^{\pi'}(s_0)-V^{\pi}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi'}}\left[A^{\pi}(s,\pi'(s))\right] \\ & V^{\pi}(s_0)-V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] \end{aligned} \right. \end{align}
⎩
成立,其中
μ
\mu
μ为状态的初始分布,
A
A
A是优势函数。
Proof V π ( s 0 ) − V π ′ ( s 0 ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] V^{\pi}(s_0)-V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] Vπ(s0)−Vπ′(s0)=1−γ1Es∼dμπ[Aπ′(s,π(s))]
V
π
(
s
0
)
−
V
π
′
(
s
0
)
=
V
π
(
s
0
)
−
E
a
0
∼
π
[
r
s
0
a
0
+
γ
E
s
1
∼
P
s
0
a
0
V
π
′
(
s
1
)
]
+
E
a
0
∼
π
[
r
s
0
a
0
+
γ
E
s
1
∼
P
s
0
a
0
V
π
′
(
s
1
)
]
−
V
π
′
(
s
0
)
=
E
a
0
∼
π
[
r
s
0
a
0
+
γ
E
s
1
∼
P
s
0
a
0
[
V
π
(
s
1
)
]
]
−
E
a
0
∼
π
[
r
s
0
a
0
+
γ
E
s
1
∼
P
s
0
a
0
[
V
π
′
(
s
1
)
]
]
+
E
a
0
∼
π
[
r
s
0
a
0
+
γ
E
s
1
∼
P
s
0
a
0
[
V
π
′
(
s
1
)
]
]
−
V
π
′
(
s
0
)
=
γ
E
a
0
∼
π
E
s
1
∼
P
s
0
a
0
[
V
π
(
s
1
)
−
V
π
′
(
s
1
)
]
+
E
a
0
∼
π
[
r
s
0
a
0
+
γ
E
s
1
∼
P
s
0
a
0
[
V
π
′
(
s
1
)
]
]
−
V
π
′
(
s
0
)
=
γ
E
a
0
∼
π
E
s
1
∼
P
s
0
a
0
[
V
π
(
s
1
)
−
V
π
′
(
s
1
)
]
+
E
a
0
∼
π
[
Q
π
′
(
s
0
,
a
0
)
−
V
π
′
(
s
0
)
]
=
γ
E
a
0
∼
π
E
s
1
∼
P
s
0
a
0
[
V
π
(
s
1
)
−
V
π
′
(
s
1
)
]
+
E
a
0
∼
π
A
π
′
(
s
0
,
a
0
)
\begin{align} \begin{aligned} V^{\pi}(s_0)-V^{\pi'}(s_0) & = V^{\pi}(s_0) -\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}V^{\pi'}(s_1)\right] + \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}V^{\pi'}(s_1)\right] - V^{\pi'}(s_0)\\ &=\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)\right]\right] -\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] \\ &+ \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] - V^{\pi'}(s_0) \\ &= \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right]+ \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] - V^{\pi'}(s_0) \\ & = \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] + \mathbb{E}_{a_0\sim\pi}\left[Q^{\pi'}(s_0,a_0) - V^{\pi'}(s_0)\right]\\ &= \textcolor{red}{\gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right]} + \mathbb{E}_{a_0\sim\pi}A^{\pi'}(s_0,a_0) \end{aligned} \end{align}
Vπ(s0)−Vπ′(s0)=Vπ(s0)−Ea0∼π[rs0a0+γEs1∼Ps0a0Vπ′(s1)]+Ea0∼π[rs0a0+γEs1∼Ps0a0Vπ′(s1)]−Vπ′(s0)=Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ(s1)]]−Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ′(s1)]]+Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ′(s1)]]−Vπ′(s0)=γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]+Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ′(s1)]]−Vπ′(s0)=γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]+Ea0∼π[Qπ′(s0,a0)−Vπ′(s0)]=γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]+Ea0∼πAπ′(s0,a0)
红色部分中括号里面的部分的形式与最初的等式形式相同。因此,继续推导红色部分
γ
E
a
0
∼
π
E
s
1
∼
P
s
0
a
0
[
V
π
(
s
1
)
−
V
π
′
(
s
1
)
]
=
γ
E
s
1
∼
P
1
π
(
⋅
∣
s
0
)
[
V
π
(
s
1
)
−
V
π
′
(
s
1
)
]
=
γ
E
s
1
∼
P
1
π
(
⋅
∣
s
0
)
[
γ
E
a
1
∼
π
E
s
2
∼
P
s
2
a
2
[
V
π
(
s
2
)
−
V
π
′
(
s
2
)
]
]
=
γ
2
E
s
2
∼
P
2
π
(
⋅
∣
s
0
)
[
V
π
(
s
2
)
−
V
π
′
(
s
2
)
]
+
γ
E
a
1
∼
π
A
π
′
(
s
1
,
a
1
)
\begin{align} \begin{aligned} \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] & = \gamma\mathbb{E}_{s_1\sim P_1^{\pi}(\cdot|s_0)}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] \\ & = \gamma\mathbb{E}_{s_1\sim P_1^{\pi}(\cdot|s_0)}\left[\gamma\mathbb{E}_{a_1\sim\pi}\mathbb{E}_{s_2\sim P_{s_2}^{a_2}}\left[V^{\pi}(s_2)-V^{\pi'}(s_2)\right]\right] \\ &= \gamma^2\mathbb{E}_{s_2\sim P_2^{\pi}(\cdot|s_0)}\left[V^{\pi}(s_2)-V^{\pi'}(s_2)\right]+\gamma\mathbb{E}_{a_1\sim \pi}A^{\pi'}(s_1,a_1) \end{aligned} \end{align}
γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]=γEs1∼P1π(⋅∣s0)[Vπ(s1)−Vπ′(s1)]=γEs1∼P1π(⋅∣s0)[γEa1∼πEs2∼Ps2a2[Vπ(s2)−Vπ′(s2)]]=γ2Es2∼P2π(⋅∣s0)[Vπ(s2)−Vπ′(s2)]+γEa1∼πAπ′(s1,a1)
由此类推,推导至无穷项,可以得出
V
π
(
s
0
)
−
V
π
′
(
s
0
)
=
∑
t
=
0
∞
γ
t
E
s
,
a
∼
π
A
π
′
(
s
,
a
)
=
1
1
−
γ
E
s
∼
d
μ
π
[
A
π
′
(
s
,
π
(
s
)
)
]
\begin{align} \begin{aligned} V^{\pi}(s_0)-V^{\pi'}(s_0) &= \sum_{t=0}^{\infty}\gamma^t \mathbb{E}_{s,a\sim\pi}A^{\pi'}(s,a)\\ &=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] \end{aligned} \end{align}
Vπ(s0)−Vπ′(s0)=t=0∑∞γtEs,a∼πAπ′(s,a)=1−γ1Es∼dμπ[Aπ′(s,π(s))]
第二个的证明与第一个完全一样。整体的表达式是对称的,只需要把
π
′
\pi'
π′和
π
\pi
π互换就好。
Lemma 2 还有另外一种表达形式 (这个就是恶心的地方,论文里都说显然,然后不同的资料用的符号和定义都不一样,或者实际上一样,但是看起来不一样>_<)
Lemma 1的另一种表达形式
{
V
π
′
(
s
0
)
−
V
π
(
s
0
)
=
∑
s
d
π
′
(
s
)
∑
a
π
′
(
s
,
a
)
A
π
(
s
,
a
)
V
π
(
s
0
)
−
V
π
′
(
s
0
)
=
∑
s
d
π
(
s
)
∑
a
π
(
s
,
a
)
A
π
′
(
s
,
a
)
\begin{align} \left\{ \begin{aligned} & V^{\pi'}(s_0)-V^{\pi}(s_0)=\sum_sd^{\pi'}(s)\sum_{a}{\pi'(s,a)A^{\pi}(s,a)} \\ & V^{\pi}(s_0)-V^{\pi'}(s_0)=\sum_sd^{\pi}(s)\sum_{a}{\pi(s,a)A^{\pi'}(s,a)} \end{aligned} \right. \end{align}
⎩
Lemma2
对于混合策略更新,有
V
π
n
e
w
−
V
π
≥
α
A
π
(
π
′
)
−
2
α
2
γ
ϵ
1
−
γ
(
1
−
α
)
\begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi}\geq\alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma(1-\alpha)} \end{aligned} \end{align}
Vπnew−Vπ≥αAπ(π′)−1−γ(1−α)2α2γϵ
成立。其中
A
π
(
π
′
)
≡
∑
s
d
π
(
s
)
∑
a
π
′
(
s
,
a
)
A
π
(
s
,
a
)
ϵ
≡
1
1
−
γ
[
max
s
∑
a
π
′
(
s
,
a
)
A
π
(
s
,
a
)
]
A_{\pi}(\pi')\equiv\sum_sd^{\pi}(s)\sum_a\pi'(s,a)A^{\pi}(s,a)\\ \epsilon\equiv\frac{1}{1-\gamma}\left[ \max_s\sum_a\pi'(s,a)A^{\pi}(s,a) \right]
Aπ(π′)≡s∑dπ(s)a∑π′(s,a)Aπ(s,a)ϵ≡1−γ1[smaxa∑π′(s,a)Aπ(s,a)]
Proof
在这里,可以将
π
n
e
w
\pi_{new}
πnew看做
α
\alpha
α的函数。当
α
=
0
\alpha=0
α=0时,
π
n
e
w
=
π
\pi_{new}=\pi
πnew=π;当
α
=
1
\alpha=1
α=1时,
π
n
e
w
=
π
′
\pi_{new}=\pi'
πnew=π′。我们计算
π
n
e
w
\pi_{new}
πnew对
α
\alpha
α的导数在
α
=
0
\alpha=0
α=0时的数值
∇
α
V
π
n
e
w
∣
α
=
0
=
∑
s
d
π
n
e
w
(
s
)
∑
a
∇
α
π
n
e
w
A
π
n
e
w
(
s
,
a
)
∣
α
=
0
=
∑
s
d
π
n
e
w
(
s
)
∑
a
(
π
′
−
π
)
A
π
n
e
w
(
s
,
a
)
∣
α
=
0
=
∑
s
d
π
(
s
)
∑
a
(
π
′
−
π
)
A
π
(
s
,
a
)
=
∑
s
d
π
(
s
)
∑
a
π
′
A
π
(
s
,
a
)
−
∑
s
d
π
(
s
)
∑
a
π
A
π
(
s
,
a
)
=
∑
s
d
π
(
s
)
∑
a
π
′
A
π
(
s
,
a
)
=
A
π
(
π
′
)
\begin{align} \begin{aligned} \nabla_{\alpha}V^{\pi_{new}}|_{\alpha=0} &= \left.\sum_sd^{\pi_{new}}(s)\sum_a\nabla_{\alpha}\pi_{new}A^{\pi_{new}}(s,a)\right|_{\alpha=0} \\ &= \left.\sum_sd^{\pi_{new}}(s)\sum_a\left(\pi'-\pi\right)A^{\pi_{new}}(s,a)\right|_{\alpha=0}\\ &= \sum_sd^{\pi}(s)\sum_a\left(\pi'-\pi\right)A^{\pi}(s,a)\\ &= \sum_sd^{\pi}(s)\sum_a\pi'A^{\pi}(s,a)-\sum_sd^{\pi}(s)\sum_a \pi A^{\pi}(s,a)\\ &= \sum_sd^{\pi}(s)\sum_a\pi'A^{\pi}(s,a)\\ &= A_{\pi}(\pi') \end{aligned} \end{align}
∇αVπnew∣α=0=s∑dπnew(s)a∑∇απnewAπnew(s,a)∣
同理,
A
π
(
π
n
e
w
)
=
α
A
π
(
π
′
)
A_{\pi}(\pi_{new})=\alpha A_{\pi}(\pi')
Aπ(πnew)=αAπ(π′)。将
V
π
n
e
w
V^{\pi_{new}}
Vπnew在
π
n
e
w
=
π
\pi_{new}=\pi
πnew=π处Taylor展开,得到
V
π
n
e
w
−
V
π
=
V
π
+
α
∇
α
V
π
n
e
w
∣
α
=
0
−
V
π
≥
α
A
π
(
π
n
e
w
)
−
O
(
α
2
)
\begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & =V^{\pi} + \alpha\nabla_{\alpha}V^{\pi_{new}}|_{\alpha=0}-V^{\pi} \\ &\geq\alpha A_{\pi}(\pi_{new}) -O(\alpha^2) \end{aligned} \end{align}
Vπnew−Vπ=Vπ+α∇αVπnew∣α=0−Vπ≥αAπ(πnew)−O(α2)
这里只说明了值函数的增益大于一个下确界,但是并没有得到目标的形式。继续推导,根据Lemma 1可知
V
π
n
e
w
−
V
π
=
∑
s
d
π
n
e
w
(
s
)
∑
a
π
n
e
w
(
s
,
a
)
A
π
(
s
,
a
)
\begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & = \sum_sd^{\pi_{new}}(s)\sum_{a}{\pi_{new}(s,a)A^{\pi}(s,a)} \end{aligned} \end{align}
Vπnew−Vπ=s∑dπnew(s)a∑πnew(s,a)Aπ(s,a)
这里定义一个变量
η
t
\eta_t
ηt,含义为截止到第
t
t
t个时刻时,策略
π
n
e
w
\pi_{new}
πnew中没有采用
π
\pi
π的步数,则有:
P
r
τ
∼
π
n
e
w
(
S
t
=
s
∣
η
t
=
0
)
=
P
r
τ
∼
π
(
S
t
=
s
)
=
(
1
−
α
)
t
\begin{align} \begin{aligned} Pr_{\tau\sim\pi_{new}}(S_t=s|\eta_t=0)=Pr_{\tau\sim\pi}(S_t=s)=(1-\alpha)^{t} \end{aligned} \end{align}
Prτ∼πnew(St=s∣ηt=0)=Prτ∼π(St=s)=(1−α)t
有的推导中写
t
−
1
t-1
t−1次方,这都无所谓的,最后是无穷项等比数列相加,不差这一个数。进而有
p
t
≡
P
r
(
η
t
>
0
)
=
1
−
(
1
−
α
)
t
\begin{align} \begin{aligned} p_t\equiv Pr(\eta_t>0)=1-(1-\alpha)^{t} \end{aligned} \end{align}
pt≡Pr(ηt>0)=1−(1−α)t
进而有 (开始了)
V
π
n
e
w
−
V
π
=
∑
s
d
π
n
e
w
(
s
)
∑
a
π
n
e
w
(
s
,
a
)
A
π
(
s
,
a
)
=
E
τ
∼
π
n
e
w
[
∑
t
γ
t
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
]
=
α
∑
t
(
1
−
p
t
)
γ
t
E
τ
∼
π
n
e
w
[
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
∣
η
=
0
]
+
α
∑
t
p
t
γ
t
E
τ
∼
π
n
e
w
[
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
∣
η
>
0
]
=
α
∑
t
(
1
−
p
t
)
γ
t
E
τ
∼
π
[
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
]
+
α
∑
t
p
t
γ
t
E
τ
∼
π
n
e
w
[
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
∣
η
>
0
]
=
α
A
π
(
π
′
)
−
α
∑
t
p
t
γ
t
E
τ
∼
π
[
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
]
+
α
∑
t
p
t
γ
t
E
τ
∼
π
n
e
w
[
∑
a
α
π
′
(
s
t
,
a
)
A
π
(
s
t
,
a
)
∣
η
>
0
]
≥
α
A
π
(
π
′
)
−
2
α
∑
t
γ
t
[
1
−
(
1
−
α
)
t
]
[
max
s
∑
a
π
′
(
s
,
a
)
A
π
(
s
,
a
)
]
=
α
A
π
(
π
′
)
−
2
α
2
γ
ϵ
1
−
γ
(
1
−
α
)
\begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} &= \sum_sd^{\pi_{new}}(s)\sum_{a}{\pi_{new}(s,a)A^{\pi}(s,a)}\\ &= \mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_t\gamma^{t}\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &= \alpha\sum_t(1-p_t)\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta=0\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &= \alpha\sum_t(1-p_t)\gamma^t\mathbb{E}_{\tau\sim\pi}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &=\alpha A_{\pi}(\pi')-\alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &\geq \alpha A_{\pi}(\pi')-2\alpha\sum_t\gamma^t\left[1-(1-\alpha)^{t}\right]\left[\max_s\sum_a\pi'(s,a)A^{\pi}(s,a)\right]\\ &= \alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma(1-\alpha)} \end{aligned} \end{align}
Vπnew−Vπ=s∑dπnew(s)a∑πnew(s,a)Aπ(s,a)=Eτ∼πnew[t∑γta∑απ′(st,a)Aπ(st,a)]=αt∑(1−pt)γtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η=0]+αt∑ptγtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η>0]=αt∑(1−pt)γtEτ∼π[a∑απ′(st,a)Aπ(st,a)]+αt∑ptγtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η>0]=αAπ(π′)−αt∑ptγtEτ∼π[a∑απ′(st,a)Aπ(st,a)]+αt∑ptγtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η>0]≥αAπ(π′)−2αt∑γt[1−(1−α)t][smaxa∑π′(s,a)Aπ(s,a)]=αAπ(π′)−1−γ(1−α)2α2γϵ
Proof completes.
单调改进
在得到Lemma 2之后,很容易得出
V
π
n
e
w
−
V
π
≥
α
A
π
(
π
′
)
−
2
α
2
γ
ϵ
1
−
γ
\begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi}\geq\alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma} \end{aligned} \end{align}
Vπnew−Vπ≥αAπ(π′)−1−γ2α2γϵ
令
R
R
R是奖励的上界 (大于零),则有
{
A
π
(
π
′
)
=
∑
s
d
π
(
s
)
∑
a
π
′
(
s
,
a
)
A
π
(
s
,
a
)
≤
R
1
−
γ
ϵ
=
1
1
−
γ
[
max
s
∑
a
π
′
(
s
,
a
)
A
π
(
s
,
a
)
]
≤
R
1
−
γ
\begin{align} \left\{ \begin{aligned} & A_{\pi}(\pi') = \sum_sd^{\pi}(s)\sum_a\pi'(s,a)A^{\pi}(s,a) \leq \frac{R}{1-\gamma} \\ & \epsilon = \frac{1}{1-\gamma}\left[ \max_s\sum_a\pi'(s,a)A^{\pi}(s,a) \right]\leq \frac{R}{1-\gamma} \end{aligned} \right. \end{align}
⎩
这时,若选取
α
=
A
π
(
π
′
)
(
1
−
γ
)
2
4
R
\alpha=\frac{A_{\pi}(\pi')(1-\gamma)^2}{4R}
α=4RAπ(π′)(1−γ)2,则有
V
π
n
e
w
−
V
π
≥
α
A
π
(
π
′
)
−
2
α
2
γ
ϵ
1
−
γ
=
A
π
2
(
π
′
)
(
1
−
γ
)
2
4
R
−
2
α
2
γ
ϵ
1
−
γ
≥
A
π
2
(
π
′
)
(
1
−
γ
)
2
4
R
−
2
α
2
γ
R
(
1
−
γ
)
2
=
A
π
(
π
′
)
(
1
−
γ
)
2
8
R
>
0
\begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & \geq \alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma}\\ &= \frac{A_{\pi}^2(\pi')(1-\gamma)^2}{4R}-\frac{2\alpha^2\gamma\epsilon }{1-\gamma}\\ & \geq \frac{A_{\pi}^2(\pi')(1-\gamma)^2}{4R} - \frac{2\alpha^2\gamma R}{(1-\gamma)^2}\\ & = \frac{A_{\pi}(\pi')(1-\gamma)^2}{8R}>0 \end{aligned} \end{align}
Vπnew−Vπ≥αAπ(π′)−1−γ2α2γϵ=4RAπ2(π′)(1−γ)2−1−γ2α2γϵ≥4RAπ2(π′)(1−γ)2−(1−γ)22α2γR=8RAπ(π′)(1−γ)2>0
如此,就保证了策略改进的单调提升,论文原文的后边还有很多推导,关于算法细节的,没太看懂,不往这放了。实际应用中,很多不会直接使用这个CPI技术,而是借用这个思路去设计。
OK,CPI理论推导到此结束了。
版权声明:本文标题:强化学习保守策略迭代Conservative policy iteration推导 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/xitong/1729755650a1212098.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论