  • Abstract

    • ❌challenge1:large-scale real-world applications

      • How to effectively leverage large, previously collected datasets in RL?

    • ✅solution:offline RL

      • without further interaction

      • learn policy from previously-collected, static dataset

    • challenge2:distribution shift——>overestimation

      distribution shift:收集数据的策略产生的action分布与需要评估的策略产生的action的分布之间可能是有一个很大的差异,这会导致当需要评估的策略产生一个不在数据集中的action时,会对这个action的Q值产生一个错误的估计,如果错误一直累积的话还会产生其他影响.

    • ✅solution:CQL

      1.theoretically: we can get practice: the standard Bellman error objective with a simple Q-value regularizer 3.conclusion:On both discrete and continuous control domains, CQL attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

      • a conservative Q-function

      • the expected value of a policy under this Q-function lower-bounds its true value.

  • 1.Introduction

    • (introduction内容和abstract内容有些重复,大致就是存在什么问题,然后提出什么方法改进,本文就是引出CQL的概念,这里看可能会有一些抽象,云里雾里,别慌,往后看)

    • ❌RL's disadvantage-->offlinen RL'schallenge

    • ✅solution:CQL

      • 1.learn a conservative estimate of the value function, which provides a lower bound on the true values


      • 2.learn a less conservative lower bound Q-function, only the expected value of Q-function under the policy is lower-bounded,as opposed to a point-wise lower bound


        • step1: minimize values under an appropriately chosen distribution over state-action tuples,

        • step2: further tighten this bound by also incorporating a maximization term over the data distribution

    • work in paper(这部分看了几篇论文下来感觉就是套话,就是说自己论文的工作具体做了些啥)

      • theoretical analysis……

      • empirically demonstrate……

        • CQL outperforms……

          • 2-5x on many benchmark tasks

          • outperform simple behavioral cloning on a number of realistic datasets collected from human interaction.

          • implemented with less than 20 lines of code

      • In experiments……

  • 2.Preliminaries

    • (这部分是预备知识,就是学习1+1=2之前,你需要认识+,=是啥意思)

    • (如果你觉得这部分太抽象,你就记住1.πβ(a|s) 是behavior policy,产生dataset,他里边的actions都是在replay buffer里边的 2.μ策略是我们要学习的policy,可能会产生OOD行为,我们需要处理)记住了这两点,可以跳过Preliminaries,往后看,遇到看不懂的再回来找概念和含义。

    • 2.1NOTES

      • 2.1.1The goal in RL
        • learn a policy that maximizes the expected cumulative discounted reward in a Markov decision process (MDP)

      • 2.2.2 (S, A, T, r, γ)
        • S, A represent state and action spaces

        • T(s‘|s, a) and r(s, a) represent the dynamics Probs and reward function

        • γ ∈ (0, 1) represents the discount factor

      • 2.2.3 dataset D

        sampled from dπβ (s)πβ(a|s).

        • πβ(a|s) represents the behavior policy(behavior policy是非常重要的概念)

        • discounted state distribution dπβ (s) is the discounted marginal state-distribution of πβ(a|s)

        • 机器学习中的数学—经验分布(Empirical分布)the empirical behavior policy πˆβ(a|s)

          在历史数据的 state s 下有多少次选择了 action a。

        • assume :|r(s, a)| ≤ Rmax

          假设 reward 有 bound

    • 2.2 Off-policy RL algorithms

      • a parametric Q-function Qθ(s, a)

        2.2.1 Q-Learning方法训练Q函数通过
        • Bellman optimality operator


        • use exact or an approximate maximization scheme, such as CEM to recover the greedy policy

          使用CEM最大化策略来执行greedy policy,选择使得Q值最大的action

          • 粗化精确匹配(CEM)的基本思想

          • ϵ-greedy Policies

      • a parametric policy πφ(a|s)

        2.2.2 在AC算法中
        • policy evaluation


          • Bellman operator

            • Pπs the transition matrix coupled with the policy

              过渡矩阵transition matrix

              • Transition Matrix与Markov Chain

      • 2.2.3 BUT!!!
        • empirical Bellman operator Bˆπ

          由于 dataset D 不会包含所有的 transition tuple (s,a,s'),所以 policy evaluation 步骤事实上用的是 empirical Bellman operator,,它只备份(backs up)单个样本

          • policy evaluation

          • policy improvement

    • 2.3 Offline RL algorithms

      • (我们知道了一些基础符号的含义NOTES部分,和在off-policy中Q函数的更新方法,那么应用到offline中又双遇到了challenge,然后作者阐述之前算法为解决OOD问题做的努力,并说明其存在问题)
      • ❌challenge:action distribution shift

        直接使用evalution&improvement学习的策略可能学习到生成数据集的策略中没有的行为OOD,从而也就没有与之对应的Q值。由于 policy 的训练目标是最大化 Q 值,可能会倾向于 Q 值被高估的 out-of-distribution 行为 致估计的Q值过大

        • the target values for Bellman backups in policy evaluation use actions sampled from the learned policy, πk

        • Q-function is trained only on actions sampled from the behavior policy that produced the dataset D, πβ.

      • ✅❌solution:constrain the learned policy away from OOD actions


        • Q-function training in offlineRL does not suffer from state distribution shift, as the Bellman backup never queries the Q-function on out-of-distribution states.

        • However, the policy may suffer from state distribution shift at test time.

  • 3.The Conservative Q-Learning Framework

    通过打压OOD(out of distribution)的q值的同时,去适当的鼓励已经在buffer(训练集)中的q值,从而防止q值被高估

    • 3.1Conservative Off-Policy Evaluation

      首先提出保守CQL(1)对于逐点无脑保守,及对于数据集内外Q都进行打压min 然后突出CQL(2),增加了一个放大约束,及对于数据集中的Q放大,仅约束V(Q的期望) 定理3.1和3.2 分别对于CQL(1)(2)进行论证

      • lower-bounds Qπ at all (s, a)

        Qˆπ := limk→∞ Qˆk, 对每一个状态动作对(s,a)都形成真实值的下界 (无脑降低所有Q值,数据集内&数据集外)

        • Our choice of penalty is to minimize the expected Q value under a particular distribution of state-action pairs, µ(s, a).

          µ(s, a)是训练的actor,s是replay buffer中,a是在这个state下,actor做出的新的action

          • 加号右边

            • 通过经验贝尔曼算子进行Q值更新,再通过MSE进行更新

          • 加号左边

            • penalty,需要对这个Q值进行打压,所以是minmize

      • the expected value of the Qˆπ under π(a|s) lower-bound Vπ

        只希望期望值是一个下界,即仅约束Vπ (对于数据集内的action的Q值增大)


          • 红色部分

            • s和a均来自replay buffer,需要鼓励(最大化)

      • Theorem 3.1

        • 说明通过(1)式获得的价值函数是真实函数每一个点对(s,a)的下界

      • Theorem 3.2

        • 当 µ = π时,Q函数的期望是真实值的下界

    • 3.2Conservation Q-Learning for Offline RL

      经过上述一系列分析,作者提出了一个用于离线RL的CQL优化算法CQL(R),在CQL(2)加一个正则化,其中正则化项选择不同,可以得到不同CQL(H)、CQL(p) 定理3.3证明了在α比较大时,CQL RL算法可以得到真实Q值的下限 定理3.4证明了Q-function is gap-expanding,即Qestimate-Qtrue逐渐扩大(这是一个负值,因为Qtrue是存在数据集中的Q,我们要逐渐放大,Qestimate是数据集外的,我们要逐渐打压)

      • CQL(R)

        • defining a family of optimization problems over µ(a|s)

          • If we choose R(µ) to be the KL-divergence against a prior distribution,ρ(a|s),


            • if ρ = Unif(a)


              • CQL(H)

            • if ρ(a|s) is chosen to be the previous policy πˆk−1(a|s)


              • CQL(p)

                • ⭐more stable with high-dimensional action spaces

          • 当前policy u 和某个先验分布 p的KL divergence

      • Theorem 3.3

        • CQL学习到真实Q值的lower-bound

      • Theorem 3.4

        • CQL更加保守,扩大了 Qestimate-Qtrue

    • 3.3Safe Policy Improvement Guarantees

      安全策略提升(safe policy improvement, SPI)就是实现安全强化学习的一种方案,这里安全的含义简单来说是:“更新后的策略,其性能不会比原策略差”。

      • Theorem 3.5

        • CQL优化的是带有惩罚的经验性目标,它表明不仅最大化经验MDP下的回报,同时约束训练的策略离产生数据集的行为策略较近(隐式的由gap-expanding引入)

      • Theorem 3.6

        • CQL导出的策略满足 ζ -safe策略提升

  • 4.Practical Algorithm and Implementation Details

    • overall algorithm

    • Implementation details

      • continuous control experiments

        • 20 lines of code on top of soft actor-critic (SAC)

        • learning rate for the policy is chosen to be 3e-5 (vs 3e-4 or 1e-4 for the Q-function)

        • α via Lagrangian dual gradient descent

      • discrete control experiments

        • 20 lines of code on top of QR-DQN

        • α fixed at constant values

  • 5.Related Work

    • Off-policy evaluation (OPE)


      • Earlier works

        • used per-action importance sampling on Monte-Carlo returnsto obtain an OPE return estimator

      • Recent approaches

        • use marginalized importance sampling by directly estimating the state-distribution importance ratios via some form of dynamic programming and typically exhibit less variance than per-action importance sampling at the cost of bias.

    • Offline RL

    • Theoretical results.

      • the regularizer in CQL explicitly addresses the impact of OOD actionsdue to its gap-expanding behavior

      • CQLdoes not require estimating the behavior policy

      • CQL does not underestimate Q-values for all state-action tuples.

      • more robust to estimation error in online RL.

  • 6.Experimental Evaluation

    • 数据集

      • d4rl

        D4RL 是离线强化学习(offline Reinforcement Learning)的开源 benchmark,它为训练和基准算法提供标准化的环境和数据集。数据集的收集策略包含

        • 7个子环境

          offline rl 是训练用 d4rl 的数据集,测试是用具体的 RL 环境来交互,比如 Mujoco. 每个子环境有5个小环境 ​

      • mujoco


    • Gym domains

    • Adroit tasks

    • AntMaze

    • Kitchen tasks

    • Offline RL on Atari games

    • Analysis of CQL

      • 只有CQL给出了负值=更保守(表中的数值就是Qestimate-Qtrue)


  • 7.Discussion

    • First,We proposed conservative Q-learning (CQL), an algorithmic framework for offline RL that learns alower bound on the policy value.

    • Second,we demonstrate that CQL outperforms prior offline RL methods by 2-5x on a wide range of offline RL benchmark tasks

    • challenge

      • deep neural nets?

      • devise simple and effective early stopping methods?

  • Appendices

    • A-Discussion of CQL Variants


      • CQL(H)
        • R = H(µ)

      • 这里贴了两张推到图,由于当时脑子转的要起飞,所以写的就比较……炸裂,希望能看懂😁
      • CQL(ρ)
        • R=KL divergence

      • CQL(var)
        • training the Q-function that penalizes the variance of Q-function predictions under the distribution Pˆ.

    • B-Discussion of Gap-Expanding Behavior of CQL Backups

      CQL gap-expanding 的实验

      • Function approximation may give rise to erroneous Q-values at OOD actions.

        使用 Q function approximation 的 lower-bound 证明 函数近似function approximation 在实际应用中,对于大规模问题,状态和动作空间都比较大的情况下,精确获得各种价值函数v(S)和q(s,a)几乎是不可能的。这时候需要找到近似的函数,具体可以使用线性组合、神经网络以及其他方法来近似价值函数。

        • prior work

          采用policy constraints技术

          • ❌“generalization” or the coupling effects of the function approximator may be heavily influenced by theproperties of the data distribution

          • ❌high values than Q-values at in-distribution actions

          • ❌the errneous Q-function may push the policy towards OOD actions

      • How can CQL address this problem?

        • CQL

          CQL regularizer解决上述问题

          • ✅maximizes Q under the dataset distribution, and minimizes them otherwise.

          • ✅by controlling αk, CQL can push down the learned Q-value at out-of-distribution actions as much isdesired, correcting for the erroneous overestimation error in the process.

      • Empirical evidence on high-dimensional benchmarks with neural networks.

        • • CQL backups are gap expanding in practice,

        • • Policy constraint methods, that do not impose any regularization on the Q-function may observe highly positive ∆ˆ k values during training,

        • • When ∆ˆ k values continuously grow during training, the policy might eventually suffer from an unlearning effect

    • C-Theorem Proofs

      出现在正文中的定理 3.1 - 3.4 的证明

      • Proof of Theorem 3.1

      • 老师带着我们推导👇

      • Proof of Theorem 3.2

      • 划重点!!!!我自己推出来了,虽然课后看的时候有个符号写错了,我竟然会推公式了,我哭😭😭😭

      • Proof of Theorem 3.3

      • Proof of Theorem 3.4

    • D-Additional Theoretical Analysis

      • D.1 CQL with Linear and Non-Linear Function Approximation

      • D.2 Choice of Distribution to Maximize Expected Q-Value in Equation 2

        公式 2 中 arg min α [E_μ Q - E_ΠβQ] ,如果把 Πβ 位置选择别的分布会怎样。

      • D.3 CQL with Empirical Dataset Distributions

        公式 2 的 sample-based version

      • D.4 Safe Policy Improvement Guarantee for CQL

        Safe Policy Improvement Guarantee for CQL

    • E-Extended Related Work and Connections to Prior Methods

      • Relationship to uncertainty estimation in offline RL

        • ❌uncertainty-basedmethods are not sufficient to prevent against OOD actions

      • How does CQL relate to prior uncertainty estimation methods?

        • CQL strengthens the popular practice of point-wise lower-bounds made by uncertainty estimation methods

      • Can we make CQL dependent on uncertainty?

    • F-Additional Experimental Setup and Implementation Details

      • Choice of α

      • Computing log Pa exp(Q(s, a)

      • Hyperparameters.

        • Q-function learning rate

        • Policy learning rate

        • Lagrange threshold τ

        • Number of gradient steps

        • Choice of Backup

    • G-Ablation Studies

    • 总结来说我对于CQL的理解就是通过加正则化项使得由behavior policyΠ产生的action的Q放大,learned policy μ产生的OOD actions的Q缩小,从而利用agent总是选择使得cumulated reward最大的action这一特性,让其尽可能选择我们人为将Q值放大的in distribution的行为。而且这种CQL方法是gap-expanding的,就是两个Q值,一个人为放大,一个人为缩小,他们的差距越来越大。妙啊🤖
