【论文笔记】Skills Regularized Task Decomposition for Multi

【论文笔记】Skills Regularized Task Decomposition for Multi

【论文笔记】Skills Regularized Task Decomposition for Multi-task Offline Reinforcement Learning

本文开源代码:Skills Regularized Task Decomposition for Multi-task Offline Reinforcement Learning


  • 【论文笔记】Skills Regularized Task Decomposition for Multi-task Offline Reinforcement Learning
    • Abstract
    • 1 Introduction
    • 2 Overall Approach
      • 2.1 Preliminary
        • Offline RL
        • Multi-task RL
        • Hidden Parameter MDP
      • 2.2 Overall Approach For Multi-task Offline RL
    • 3 Task Decomposition with Quality-aware Skill Regularization
      • 3.1 Learning Skill Embeddings
      • 3.2 Skill-regularized Task Decomposition
    • 4 Data Augmentation by Imaginary Demonstrations
    • 5 Experiments
        • Experiment settings
        • Comparison methods
        • Offline datasets
      • 5.1 Meta-world Tests
        • Performance on MT10 benchmark
        • Ablation study
      • 5.2 A Case Study for Airsim-based Drone Navigation
    • 6 Related Work
        • Multi-task RL
        • Task and skill embeddings in multi-task RL
        • Data augmentation in offline RL
    • 7 Conclusion


  1. 研究背景:使用不同离线数据集的强化学习(RL)可以利用多个任务之间的关系和跨这些任务学习到的共同技能,从而使我们能够以数据驱动的方式有效地处理现实世界中的复杂问题。

  2. 研究问题:在离线强化学习中,只使用离线数据,与环境的联机交互受到限制,但很难实现多个任务的最优策略,特别是在任务的数据质量不同的情况下。

  3. 解决思路:基于技能的多任务强化学习技术 + 由不同品质的行为策略产生的异质数据集

    In this paper, we present a skill-based multi-task RL technique on heterogeneous datasets that are generated by behavior policies of different quality.

  4. 具体技术路线:

    • 为了有效地学习这些数据集之间可共享的知识,我们采用了一种任务分解方法,通过这种方法,共享技能被共同学习,并用作指导,以将任务重新构造为共享且可实现的子任务

      To learn the shareable knowledge across those datasets effectively, we employ a task decomposition method for which common skills are jointly learned and used as guidance to reformulate a task in shared and achievable subtasks.

    • 在这种联合学习中,我们使用Wasserstein自编码器WAE)来表示相同潜在空间中的技能和任务,并使用质量加权损失作为正则化项,以促使任务被分解为与高质量技能更一致的子任务。

      In this joint learning, we use Wasserstein auto-encoder (WAE) to represent both skills and tasks on the same latent space and use the quality-weighted loss as a regularization term to induce tasks to be decomposed into subtasks that are more consistent with high-quality skills than others.

    • 为了提高在潜在空间中学习的离线强化学习智能体的性能,我们还使用与每个任务相关的虚构轨迹来增强数据集,这些轨迹与高质量技能相关。

      To improve the performance of offline RL agents learned on the latent space, we also augment datasets with imaginary trajectories relevant to high-quality skills for each task.

  5. 实验设置:several robotic manipulation tasks and drone navigation tasks

  6. 实验结论:我们的多任务离线强化学习方法对不同质量的数据集的混合配置具有鲁棒性。它的表现优于其他最先进的算法。

1 Introduction


Recently, a data sharing method for multi-task learning was introduced to address the issue of limited data for real-world control applications. Yet, multi-task RL has not been fully investigated in offline settings.


In the offline RL context, we present a novel multi-task model by which a single policy for multiple tasks can be data-efficiently achieved and its learning procedure is robust to heterogeneous datasets of different quality.



In offline RL where interaction with the environment is not allowed and arbitrary or low-performance behavior policies might be involved in data collection, it is important to maintain the robustness in learning on different-quality data.

To this end, we devise a joint learning mechanism of skill (short-term action sequences from the datasets) and task representation, which enables the task decomposition into achievable subtasks via quality-aware skill regularization. The model ensures the robustness of learned policies upon the mixed configurations of different-quality datasets.


We also employ data augmentation based on high-quality skills, thus creating plausible trajectories and alleviating the limited quality and scale issues of offline datasets, which is aiming at creating imaginary trajectories that are likely to be generated by expert policies.


2 Overall Approach

2.1 Preliminary

Offline RL

离线强化学习旨在最大化累积折扣奖励 J ( π ) J(\pi) J(π) ,采用与传统强化学习相同的公式;然而,与之不同的是,离线强化学习假定仅使用先前收集的轨迹的静态数据集 D = { ( s t , a t , r t , s t + 1 ) } D = \{(s_t, a_t, r_t, s_{t+1})\} D={(st​,at​,rt​,st+1​)} 进行训练。它几乎不考虑与环境的交互。

Offline RL algorithms can increase the usability of previously collected data in the domain of making sequential decisions where temporal credit assignment with long time horizons is important.


Multi-task RL

Multi-task RL considers more than a single task when achieving the optimal policy π ∗ \pi^{\ast} π∗ . It is normally formulated as a family of MDPs T i = ( s i , a i , r i , s n e x t i ) i \mathbf{T}_{i}=(s^{i}, a^{i}, r^{i}, s_{next}^{i})_{i} Ti​=(si,ai,ri,snexti​)i​ where each individual task T i \mathbf{T}_{i} Ti​ is associated with its respective MDP and it is sampled according to a task distribution p ( T ) p(\mathbf{T}) p(T) .

Hidden Parameter MDP

为了表示多任务环境中与每个任务的马尔可夫属性相关的隐含时间动态特性,我们引入了一个隐藏潜在变量 v t v_{t} vt​ 。
R ( s t , v t , a t ) : = R v t ( s t , a t ) P ( s t + 1 , v t + 1 ∣ a t , s t , v t + 1 ) : = P v t ( s t + 1 ∣ a t , s t ) R(s_t,v_t,a_t):=R_{v_{t}}(s_t,a_t) \\ P(s_{t+1},v_{t+1}|a_t,s_t,v_{t+1}):=P_{v_{t}}(s_{t+1}|a_t,s_t) R(st​,vt​,at​):=Rvt​​(st​,at​)P(st+1​,vt+1​∣at​,st​,vt+1​):=Pvt​​(st+1​∣at​,st​)
实际状态空间扩展为 S × V S\times V S×V , V V V 是潜在变量 v t v_{t} vt​ 的集合。

部分可观察MDP(POMDP),其被规定为元组 ( S × V , A , Ω , P V , R V , O , γ ) (S\times V,A,\Omega,P_{V},R_{V},O,\gamma) (S×V,A,Ω,PV​,RV​,O,γ) 。

其中 Ω = S \Omega = S Ω=S , O ( s t , v t ) → s t O(s_t,v_t)\rightarrow s_t O(st​,vt​)→st​ 表示观察空间和观察函数。

Comments 作者在这里引入潜在变量的原因:表示MDP属性相关的隐含时间动态特性,这个说法可以用于元强化学习;此外,作者在这里的多任务服从同一个分布和元强化学习也有相似之处。

2.2 Overall Approach For Multi-task Offline RL


arg ⁡ max ⁡ π J D ( π ) − α ⋅ c ( π , π D ) \arg\max\limits_{\pi}\text{ }J_{\mathbf{D}}(\pi)-\alpha\cdot c(\pi,\pi_{\mathbf{D}}) argπmax​ JD​(π)−α⋅c(π,πD​)

  • J D ( π ) J_{\mathbf{D}}(\pi) JD​(π) 是策略 π \pi π 在给定数据集 D \mathbf{D} D 下的最大平均累计奖励,而这个数据集 D \mathbf{D} D 是由策略 π D \pi_{\mathbf{D}} πD​ 生成的。
  • c ( ⋅ ) c(\cdot) c(⋅) 是一个正则化项目,用于减小策略 π \pi π 和策略 π D \pi_{\mathbf{D}} πD​ 之间的差异。避免策略 π \pi π 收敛在一个很奇怪的“点”。

With this regularization by the behavior policy, offline RL algorithms are often vulnerable to low-quality datasets. Overfitting problems can occur such that the maximum average return max ⁡ π J D ( π ) \max\limits_{π} J_{\mathbf{D}}(\pi) πmax​JD​(π) is much lower than that of its respective true MDP M M M, when a low-performance or arbitrary policy is used for data generation.

作者指出了这样的不足之处:就是因为这个正则化项,导致智能体的学习目标一定要和生成的策略逼近,这就导致离线强化学习在策略差异比较大时生成的数据集,就会很脆弱。当使用低性能或任意策略进行数据生成时,过度拟合问题可能会发生,导致 M ^ \hat{M} M^ 的最大平均回报 max ⁡ π J D ( π ) \max\limits_{π} J_{\mathbf{D}}(\pi) πmax​JD​(π) 远远低于其相应的真实MDP M M M 。

In multi-task offline RL, we reformulate a family of MDPs { T i } i \{T_{i}\}^{i} {Ti​}i as a hidden parameter MDP in that multiple MDPs are combined into a single POMDP based on hidden parameters that specify temporal Markovian properties of the environment.

While the overfitting issue of offline RL can be alleviated by exploring the relation of multiple tasks and inducing the shareable knowledge from their datasets in a multi-task setting, it is not guaranteed that inferring the hidden parameters fully enables the well-structured representation of related tasks.

It is because the behavior policy heterogeneity and state-action pair disparity of tasks can prevent the sub-trajectories of common-knowledge tasks from being closely mapped on the latent space.


  1. 在多任务离线强化学习中,作者将一族 MDP { T i } i \{T_{i}\}^{i} {Ti​}i 重新构造成一个具有隐藏参数的 MDP 。能这样构造的原因是:多个 MDP 可以结合成一个部分可观的 POMDP ,这个 POMDP 里面的隐藏参数 hidden parameters 能够用来特指/特定化时间差分的马尔可夫性质。
  2. 虽然离线强化学习的过拟合问题,可以通过探索多任务之间的关系和提取数据集之间可共享的知识来避免/减轻;
  3. 因为行为策略存在异质性,且动作状态对存在不一致性,这就导致具有共同知识的“部分轨迹”不能在同一个潜在空间上很紧密的映射出来。

  • 蓝色圆形表示任务嵌入,绿色圆形表示技能嵌入。

    机理:任务嵌入得到 z 1 z_{1} z1​ 变量,技能嵌入得到 b 1 b_{1} b1​ 变量。随后将 z 1 z_{1} z1​ 变量变成 z 1 ′ z_{1}^{\prime} z1′​ 变量使之更靠近 b 1 b_{1} b1​ 变量。

  • (a) 部分:

    Sub-trajectories from static datasets are converted into skill embeddings and task embeddings on the same latent space, which together enable the decomposition of tasks into achievable subtasks.

    来自静态数据集的子轨迹被转变成技能嵌入 skill embeddings 和任务嵌入 task embeddings ,这两个嵌入都是在同一个浅层空间上,这样做能共同地把任务解耦合成可实现的子任务。

    The action sequence of the sub-trajectory τ 1 τ_{1} τ1​ with large returns.

    作者是这样对“好的/优质的”技能嵌入 b 1 b_{1} b1​ 变量做定义:子轨迹 τ 1 τ_{1} τ1​ 中具有很大回报的动作序列。


    通过共同学习通用技能并通过质量注意力技能规范化(Quality-aware Skill Regularization)来适应子任务,使得能够在可达子任务的潜在空间中对单个任务进行分解和重构,从而实现更可行的表示。

  • (b) 部分:

    In (b), for training offline RL agents, imaginary trajectories similar to expert demonstrations are sampled from the latent space and added to the datasets.



3 Task Decomposition with Quality-aware Skill Regularization


在图的右侧,红色箭头表示 L P R L_{PR} LPR​ ,它使得低质量的子轨迹在任务的先验分布内拉伸(从深粉红色到浅粉红色),蓝色箭头表示 L S R L_{SR} LSR​ ,它使得高质量的子轨迹在技能的分布周围收缩(从浅蓝色到深蓝色)。

3.1 Learning Skill Embeddings

为了将智能体的行为表示为潜在空间 Z \mathbf{Z} Z 中的向量,作者使用了自编码机制。

考虑到短期范围上的动作序列捕获了智能体对特定任务的行为,我们称之为潜在向量 b t b_t bt​ 技能嵌入。

编码器 q ϕ q_{\phi} qϕ​ 将状态动作对序列 d t = ( s , a ) t − n : t + n − 1 d_{t}=(s,a)_{t-n:t + n-1} dt​=(s,a)t−n:t+n−1​ 作为输入,将其映射到潜在向量 b t ∈ Z b_{t} \in\mathbf{Z} bt​∈Z ,而解码器 p ϕ p_{\phi} pϕ​ 则从 b t b_{t} bt​ 和 s t − n : t + n − 1 s_{t-n:t + n-1} st−n:t+n−1​ 的组合中重构出输入的动作序列 a t − n : t + n − 1 a_{t-n:t + n-1} at−n:t+n−1​ 。

For maintaining the learning stability on skill embeddings b t ∈ Z b_{t} \in\mathbf{Z} bt​∈Z , we use Wasserstein auto-encoder (WAE) with the maximum mean discrepancy (MMD)-based penalty and a prior distribution on b t b_t bt​ .

为了保持技能嵌入 b t ∈ Z b_{t} \in\mathbf{Z} bt​∈Z 的学习稳定性,我们使用了基于最大平均差异的惩罚项的沃瑟斯坦变分编码器,和 b t b_t bt​ 的先验分布。

{ b i ^ } i = 1 m ∼ P B \{\hat{b_{i}}\}_{i=1}^{m}\sim P_{B} {bi​^​}i=1m​∼PB​ 是从一个技能嵌入分布的先验中采样得到; λ > 0 \lambda >0 λ>0 是基于先验分布的正则化超参数;

L P R L_{PR} LPR​ 用来限制技能嵌入。

m m m 表示采样得到的 { b , b ^ } \{b,\hat{b}\} {b,b^} 的大小, k : Z × Z → R k:\mathbf{Z}\times\mathbf{Z}\rightarrow\mathbf{R} k:Z×Z→R 表示正项定义的核。

3.2 Skill-regularized Task Decomposition

  1. 我们将任务看成一些子任务的组合,这些子任务可以被建模成具有隐藏参数的 MDP 。

    We first view each task as a composition of subtasks which can be modeled as a hidden parameter MDP.

  2. 对于任务嵌入,我们使用基于沃瑟斯坦变分编码器的模型结构,与先前的技能嵌入的构成类似。

    For task embeddings, we then use the WAE-based model architecture similar to skill embeddings previously described.

  3. 对于 n n n 长度的状态转移的子轨迹 τ t = ( s t − n : t , a t − n − 1 : t − 1 , r t − n − 1 : t − 1 ) τ_{t} = (s_{t−n:t}, a_{t−n−1:t−1}, r_{t−n−1:t−1}) τt​=(st−n:t​,at−n−1:t−1​,rt−n−1:t−1​)

    作者使用一个编码器 q θ : τ t → z t ∈ Z q_{\theta}:τ_{t}\rightarrow z_t \in \mathbf{Z} qθ​:τt​→zt​∈Z 来产生 Z \mathbf{Z} Z 空间下的任务嵌入;

    作者使用一个解码器 p θ : ( s t , a t , z t ) → ( s t + 1 , r t ) p_{θ} : (s_t, a_t, z_t) \rightarrow (s_{t+1}, r_t) pθ​:(st​,at​,zt​)→(st+1​,rt​) 来表达状态转移概率 P P P 和奖励函数 R R R 。

    For sub-trajectories τ t = ( s t − n : t , a t − n − 1 : t − 1 , r t − n − 1 : t − 1 ) τ_{t} = (s_{t−n:t}, a_{t−n−1:t−1}, r_{t−n−1:t−1}) τt​=(st−n:t​,at−n−1:t−1​,rt−n−1:t−1​) of n n n-length transitions each, we have an encoder q θ : τ t → z t ∈ Z q_{\theta}:τ_{t}\rightarrow z_t \in \mathbf{Z} qθ​:τt​→zt​∈Z to yield task embeddings and a decoder p θ : ( s t , a t , z t ) → ( s t + 1 , r t ) p_{θ} : (s_t, a_t, z_t) \rightarrow (s_{t+1}, r_t) pθ​:(st​,at​,zt​)→(st+1​,rt​) to express the transition probability P P P and reward function R R R

    所以任务嵌入的训练目标是:(类似于一种 model-based 的方法)

    这个是任务嵌入,他的状态、动作以及奖励的获得都会收到品质的影响。因此需要在此处增加正则化,也就是增加在这段状态转移带来的累计奖励。所以作者的 quality-aware 其实就是增加了前面的奖励正则化???


这使得编码器 q θ q_{θ} qθ​ 能够在多任务背景中,通过一系列子轨迹生成子任务级别的嵌入(或子任务嵌入)。特别地,每个任务都被表示为与一些具有大量回合收益的轨迹中学习到的高质量技能密切相关。通过更多地使用高质量技能进行任务无关的训练,这种任务分解减少了低质量数据的不良影响,并将任务分解为更可实现的子任务。




接下来,作者提供 skill-regulization 效应的分析。

设 q q q 和 p p p 为通过最小化 L S E L_{SE} LSE​ 中损失函数得到的技能编码器和解码器,类似于其他文献中将 p p p 视为环境的一部分。解码器 p p p 遵循 MDP M p = ( S , A = Z , P p , R p , γ ) M_p =(S,A = Z,P_p,R_p,\gamma) Mp​=(S,A=Z,Pp​,Rp​,γ) ,其中高级(技能)动作 z t ∈ Z z_t \in \mathbf{Z} zt​∈Z 被转换为直接与环境交互的低级(原始)动作 a t ∼ p ( ⋅ ∣ s t , z t ) a_t\sim p(\cdot|s_t,z_t) at​∼p(⋅∣st​,zt​) 。

Karl Pertsch, Youngwoon Lee, and Joseph J Lim. “Accelerating reinforcement learning with learned skill priors”. In: arXiv preprint: 2010.11944 (2020).

Taewook Nam et al. “Skill-based Meta-Reinforcement Learning”. In: Proceedings of 10th International Conference on Learning Representations (ICLR). 2022.

此外,假设 L T E L_{TE} LTE​ 中的子轨迹 τ \tau τ 和 L S E L_{SE} LSE​ 中的状态-动作对序列 d d d 受限于当前状态,获得了针对 MDP M p M_p Mp​ 训练的高级策略 q θ q_θ qθ​ 和 q q q 。由于 q θ q_θ qθ​ 的输出包含在 M p M_p Mp​ 的输入状态中,因此我们的目标是最大化 q θ q_θ qθ​ 和 q q q 之间的性能差距,其中 J p J_{p} Jp​ 是 MDP M p M_p Mp​ 中的平均回报。
max ⁡ η ( θ ) = J p ( q θ ) − J p ( q ) \max\text{ }\eta(\theta)=J_{p}(q_{\theta})-J_{p}(q) max η(θ)=Jp​(qθ​)−Jp​(q)

根据文献,我们得到 η ( θ ) = E s ∼ d q θ , z ∼ q θ [ R s , z q − V q ( s ) ] \eta(\theta)= E_{s\sim d_{q_{θ}},z\sim q_{θ}} [R^{q}_{s,z} - V_{q}(s)] η(θ)=Es∼dqθ​​,z∼qθ​​[Rs,zq​−Vq​(s)] ,其中 d q θ d_{q_{θ}} dqθ​​ 是由 q θ q_θ qθ​ 引起的状态访问分布, R s , z q R^{q}_{s,z} Rs,zq​ 是由 q q q 引起的回合收益, V q V_q Vq​ 是 q q q 的值函数。

Sham M. Kakade and John Langford. “Approximately Optimal Approximate Reinforcement Learning”. In: Proceedings of the 19th International Conference on Machine Learning (ICML). 2002, pp. 267–274.

在离线强化学习中,精确地逼近 q θ q_θ qθ​ 是困难的,因此我们更希望使用 q q q 的分布作为 q θ q_θ qθ​ 的状态访问分布,以避免过多的传播误差。为了实现这一点,我们在 q q q 和 q θ q_θ qθ​ 保持紧密接近的限制下,优化 η ^ ( θ ) = E s ∼ q , z ∼ q θ [ R s , z q − V q ( s ) ] \hat{\eta}(\theta)= E_{s\sim q,z\sim q_{θ}} [R^{q}_{s,z} - V^{q}(s)] η^​(θ)=Es∼q,z∼qθ​​[Rs,zq​−Vq(s)]

像这种带有显式约束条件的优化,都可以用拉格朗日转化成一个非限制的优化, β \beta β 是拉格朗日乘子。

通过对上述式子右侧关于 q θ q_{θ} qθ​ 求导,并遵循文献中的最优策略推导过程,获得了满足下面回报加权条件的闭式解。

Xue Bin Peng et al. “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning”. In: arXiv preprint: 1910.00177 (2019).

Aviral Kumar, Xue Bin Peng, and Sergey Levine. “Reward-conditioned policies”. In: arXiv reprint: 1912.13465 (2019).

当省略基线项 V q ( s ) V_{q}(s) Vq​(s) 并达到常数时,我们还发现 L S R L_{SR} LSR​ 的加权技能正则化损失可以使子任务嵌入与给定任务的高质量技能相匹配,从而促进任务分解为可共享和可实现的子任务。


4 Data Augmentation by Imaginary Demonstrations





然后作者得到了 imaginary trajectory 的生成方式:通过状态和浅层变量的信息(技能解码器)的得到动作;通过状态和浅层变量的信息,以及生成的动作信息,得到想象的下一时刻的状态和想象的奖励。

Note that in this generative model, p θ p_θ pθ​ performs the same role of the world model in conventional model-based RL approaches.

这个也好理解,model-based 的方法是学习环境的动力学特性,这部分相当于学好了环境动力学然后做预测

Accordingly, it turns out that the augmentation procedure in (9) yields a plausible trajectory similar to expert demonstrations, given that the high-quality skill corresponding to the trajectory is incorporated into p ϕ p_ϕ pϕ​ .

作者发现,通过自己的方法可以生成很好的轨迹,这个轨迹跟专家示教轨迹一样,这是因为对应这条轨迹的高质量技能被集成到了 p ϕ p_ϕ pϕ​ 中。


5 Experiments

Experiment settings

机器人操作环境 —— Neta-world 无人机环境 —— the Airsim drone simulator

Comparison methods
nameTD3 + BCPCGradSoft modularization (SoftMod)
context最好的离线强化学习算法;在 TD3 的更新步骤中加入了一个行为克隆的正则化项;包括一个独热编码的任务表示作为状态的一部分。一种基于梯度修剪的多任务强化学习算法;使用投影函数来消除梯度之间的方向冲突。专为多任务强化学习定制的模块化深度神经网络架构;减轻在单个策略上学习不同任务的负面影响,使用一组专门针对多个任务进行训练的模块上的软加权路由路径,它还采用了一种损失平衡策略。

u1s1,作者的这种 baseline 介绍的写法很值得学习啊~~~

Offline datasets


请注意,除非另有说明,否则每个任务的MR、RP和ME数据集分别包含150、100和50个 episode 轨迹。

5.1 Meta-world Tests

MT10 benchmark (i.e., 10 different control tasks)

The tasks share common primitive functions such as grasp and moving, so they can be seen as general multi-tasks with shared subtasks, which are consistent with our task decomposition strategy.


Performance on MT10 benchmark

TD3+BC and PCGrad show better performance for the configurations of low-quality datasets, e.g., the row of (MR 10, RP 0, ME 0), but SoftMod shows better performance for the configurations of high-quality datasets e.g., the row of (MR 0, RP 0, ME 10).

TD3+BC and PCGrad explore the orthogonality of tasks by accumulating task-specific knowledge separately without much interference when learning different tasks, and SoftMod rather exploits the commonality of the tasks by learning shared skills and dynamically extracting task-specific knowledge by the combination of its modules.


Specifically, our TD3+BC implementation with one-hot task encoding tends to learn individual tasks separately, considering that the task encoding does not represent the semantic relation of different tasks explicitly.

具体来说,使用一位有效编码的 TD3+BC 实现倾向于单独学习各个任务,考虑到任务编码未明确表示不同任务之间的语义关系。

Ablation study

SRTD-Q:denotes SRTD without the quality weighted term

SRTD+N:denotes SRTD with the Gaussian noise-based data augmentation commonly used in offline RL

5.2 A Case Study for Airsim-based Drone Navigation


6 Related Work

Multi-task RL


Yang等人提出了一个明确的模块化架构,带有软路由网络,用于训练集成的多任务策略。这种软模块化称为soft modularization,解决了单个网络中任务关系不清晰的问题,即哪些共享参数与哪些任务相关。

Ruihan Yang et al. “Multi-task reinforcement learning with soft modularization”. In: Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS). 2020.


Tianhe Yu et al. “Gradient surgery for multi-task learning”. In: Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS). 2020.

Task and skill embeddings in multi-task RL



Karl Pertsch et al. “Demonstration-Guided Reinforcement Learning with Learned Skills”. In: Proceedings of the 5th Conference on Robot Learning (CoRL). Vol. 164. PMLR. 2022.


Shagun Sodhani, Amy Zhang, and Joelle Pineau. “Multi-task reinforcement learning with context-based representations”. In: Proceedings of 38th International Conference on Machine Learning (ICML). PMLR. 2021, pp. 9767–9779.


Data augmentation in offline RL



Samarth Sinha, Ajay Mandlekar, and Animesh Garg. “S4RL: Surprisingly simple self-supervision for offline reinforcement learning in robotics”. In: Proceedings of 5th Conference on Robot Learning (CoRL). PMLR. 2022, pp. 907–917.


Tianhe Yu et al. “Conservative data sharing for multi-task offline reinforcement learning”. In: Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS). 2021


7 Conclusion

The direction of our future works is to investigate the hierarchy of skill representation with different temporal abstraction levels in multi-task offline RL. This will tackle the limitation of our model that considers only fixed-length sub-trajectories for task and skill embeddings.



【论文笔记】Skills Regularized Task Decomposition for Multi

