We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples. Such zero-shot segmentation ability relies on inter-class relationships in semantic space to transfer the visual knowledge learned from seen categories to unseen ones. Thus, it is desired to well bridge semantic-visual spaces and apply the semantic relationships to visual feature learning. We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces as well as addresses the issue of lack of unseen training data. Furthermore, to mitigate the domain gap between semantic and visual spaces, firstly, we enhance the vanilla generator with learned primitives, each of which contains fine-grained attributes related to categories, and synthesize unseen features by selectively assembling these primitives. Secondly, we propose to disentangle the visual feature into the semantic-related part and the semantic-unrelated part that contains useful visual classification clues but is less relevant to semantic representation. The inter-class relationships of semantic-related visual features are then required to be aligned with those in semantic space, thereby transferring semantic knowledge to visual feature learning. The proposed approach achieves impressively state-of-theart performance on zero-shot panoptic segmentation, instance segmentation, and semantic segmentation. 


1. Introduction

Image segmentation aims to group pixels with different semantics, e.g., category or instance [11,41]. Deep learning methods [9, 11, 17, 29, 36, 37, 48] have greatly advanced the performance of image segmentation with the powerful learning ability of CNNs [30] and Transformer [59]. However, since deep learning methods are data-driven, great challenges are induced by the intense demand for largescale labeled training samples, which are labor-intensive and time-consuming. To address this issue, zero-shot learning (ZSL) [38,51] is proposed to classify novel objects with no training samples. Recently, ZSL is extended into segmentation tasks like zero-shot semantic segmentation (ZSS) [4, 62] and zero-shot instance segmentation (ZSI) [68]. Herein, we further introduce zero-shot panoptic segmentation (ZSP) and aim to build a universal framework for zero-shot panoptic/semantic/instance segmentation with the help of semantic knowledge, as shown in Fig. 1.


Figure 1. Zero-shot image segmentation aims to transfer the knowledge learned from seen classes to unseen ones (i.e., never shown up in training) with the help of semantic knowledge 


Different from image classification, segmentation requires pixel-wise classification and is more challenging in terms of class representation learning. Substantial efforts have been devoted to zero-shot semantic segmentation [4, 62] and can be categorized into projection-based methods [20, 62, 66] and generative model-based methods [4, 27, 40]. The generative model-based methods are usually superior to the projection-based methods because they produce synthetic training features for the unseen group, which contribute to alleviating the crucial bias issue [53] of tending to classify objects into seen classes. Owing to the above merits, we follow the paradigm of generative model-based methods to address zero-shot segmentation tasks. 


However, the current generative model-based methods are usually in the form of per-pixel-level generation, which is not robust enough in the more complicated scenarios. Recently, several works propose to decouple the segmentation into class-agnostic mask prediction and object-level classification [8, 11, 31, 61]. We follow this strategy and degenerate the pixel-level generation to a more robust object-level generation. What’s more, previous generative works [4, 27, 40] usually learn a direct mapping from semantic embedding to visual features. Such a generator does not consider the visual-semantic gap of feature granularity that images contain much richer information than languages. The direct mapping from coarse to finegrained information results in low-quality synthetic features. To address this issue, we propose to utilize abundant primitives with very fine-grained semantic attributes to compose visual representations. Different assemblies of these primitives construct different class representations, where the assembly is decided by the relevance between primitives and semantic embeddings. Primitives greatly enhance the expressive diversity and effectiveness of the generator, especially in terms of rich fine-grained attributes, making the synthetic features for different classes more reliable and discriminative. 


However, there are only real image features of seen classes to supervise the generator, leaving unseen classes unsupervised. To provide more constraints for the feature generation of unseen classes, we propose to transfer the inter-class relationships in semantic space to visual space. The category relationships obtained by semantic embeddings are employed to constrain the inter-class relationships of visual features. With such constraint, the visual features, especially the synthesized features for unseen classes, are promoted to have a homogeneous inter-class structure as in semantic space. Nevertheless, there is a discrepancy between the visual space and the semantic space [10, 57], so as to their inter-class relationships. Visual features contain richer information and cannot be fully aligned with semantic embeddings. Directly aligning two disjoint relationships inevitably compromises the discriminative of visual features. To address this issue, we propose to disentangle visual features into semantic-related and semanticunrelated features, where the former is better aligned with the semantic embedding while the latter is noisy to semantic space. We only use semantic-related features for relationship alignment. The proposed relationship alignment and feature disentanglement are mutually beneficial. Feature disentanglement builds semantic-related visual space to facilitate relationship alignment and excludes semanticunrelated features that are noisy for alignment. Relationship alignment in turn contributes to disentangling semanticrelated features by providing semantic clues.


Overall, the main contributions are as follows: 总体而言,主要贡献如下:

• We study universal zero-shot segmentation and propose Primitive generation with collaborative relationship Alignment and feature Disentanglement learning (PADing) as a unified framework for ZSP/ZSI/ZSS. •我们研究了通用零样本分割,并提出了使用协作关系对齐和特征去纠缠学习(PADing)的原语生成,作为ZSP/ZSI/ZSS的统一框架。

• We propose a primitive generator that employs lots of learned primitives with fine-grained attributes to synthesize visual features for unseen categories, which helps to address the bias issue and domain gap issue. •我们提出了一种基元生成器,该生成器使用大量具有细粒度属性的学习基元来合成看不见类别的视觉特征,这有助于解决偏差问题和领域差距问题。

• We propose a collaborative relationship alignment and feature disentanglement learning approach to facilitate the generator producing better synthetic features. •我们提出了一种协作关系对齐和特征解纠缠学习方法,以促进生成器生成更好的合成特征。

• The proposed approach PADing achieves new stateof-the-art performance on zero-shot panoptic segmentation (ZSP), zero-shot instance segmentation (ZSI), and zero-shot semantic segmentation (ZSS). •所提出的方法PADing在零样本全景分割(ZSP)、零样本实例分割(ZSI)和零样本语义分割(ZSS)方面实现了最新的性能。

2. Related Work

Zero-shot learning (ZSL) [35, 38, 51, 67] aims to classify images of unseen classes with no training samples via utilizing semantic descriptors as auxiliary information. There are two main paradigms: classifier-based methods that learn a visual-semantic projection [1, 42, 67] and instance-based methods [21, 64] that synthesize fake samples for unseen classes. Generalized zero-shot learning (GZSL), introduced by Scheirer et al. [55], aims to classify samples from both seen and unseen sets. Then, Chao et al. [6] show that the ZSL methods can’t work well in GZSL setting from experiments, due to the feature of overfitting on seen classes. Classification score calibration methods [5,13, 28, 33] and out-of-distribution detector methods [3, 24] are proposed to alleviate this bias issue. 


  • [55] Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE TPAMI, 35(7), 2012. 2

Image Segmentation is one of the most fundamental computer vision tasks [14,18,19,22,45,46]. Deep-learningbased image segmentation methods under a fully supervised manner are extensively studied [9, 11, 15, 16, 29, 36, 37, 47, 48, 56, 61]. However, these methods require a large number of labeled training samples and cannot handle unseen categories that do not appear or are not defined in training data. To address these issues, Zero-Shot Semantic Segmentation (ZSS) [4] and Zero-Shot Instance Segmentation (ZSI) [68] extend ZSL methods to semantic segmentation and instance segmentation, respectively. In this work, we further introduce Zero-Shot Panoptic Segmentation (ZSP) to extend the zero-shot learning to the panoptic segmentation task. There are two main paradigms: projection-based methods [8, 20, 32, 49, 52, 62, 63, 66] and generative-based methods [4, 12, 40]. Projection-based techniques commonly utilize a projection approach to map the visual or semantic features of seen categories onto a shared space. (e.g., visual, semantic, or latent space), and then classify novel objects by measuring the feature  similarity in the common space. The generative methods adopt generator to produce synthetic features for unseen classes. However, existing generative works [4, 27, 40] usually learn a direct mapping from semantic embedding to visual features and do not consider the visual-semantic gap of feature granularity. We design a primitive generation and semantic-related alignment approach to universally address zero-shot segmentation, including ZSP, ZSI, and ZSS.


3. Methodology

Fig. 2 illustrates the overview architecture of our proposed approach, Primitive generation with collaborative relationship Alignment and feature Disentanglement learning (PADing). Our backbone predicts a set of class-agnostic masks and their corresponding class embeddings. Primitive generator is trained to synthesize class embeddings from semantic embeddings. The real & synthetic class embeddings are disentangled to semantic-related and semantic-unrelated features. We conduct the relationship alignment learning on the semantic-related feature. With the synthesized unseen class embeddings, we re-train our classifier with both the real class embedding of seen categories and the synthetic class embedding of unseen categories. The training process is demonstrated in Algorithm 1. The details of each part will be introduced in the following sections. 


Figure 2. Overview of our approach PADing for universal zero-shot image segmentation. We first obtain class-agnostic masks and their corresponding global representations, named class embeddings, from our backbone. A primitive generator is trained to produce synthetic features (i.e., fake class embeddings). The classifier, which takes class embeddings as input, is trained with both the real class embeddings from image and synthetic class embeddings by the generator. During the training of the generator, the proposed feature disentanglement and relationship alignment are employed to constrain the synthesized features.


3.1. Task Formulation 

Herein we give the problem formulation of zero-shot image segmentation. There are two spaces, feature space X and semantic space A, to represent the visual features of images and semantic representations of categories, denoted as X = {Xs , Xu}, A = {As , Au}, respectively. The superscript s and u represent the two non-overlapping groups, Ns seen categories and Nu unseen categories,respectively. We use Y = {Y s , Y u} to denote the ground truth label. Y s is label set of seen group and Y u is label set of unseen group, Y s ∩ Y u = ∅. The training set is constructed from the images that contain any of the Ns seen categories but no unseen categories, which is different from the open-vocabulary paradigm [34, 65]. According to the categories that appear in the testing set, there are two different settings named zero-shot learning (ZSL) and generalized zero-shot learning (GZSL). ZSL only classifies testing samples of unseen categories while GZSL needs to classify testing data of both seen and unseen categories. Zero-shot segmentation is naturally a kind of GZSL since the because images typically contain multiple and diverse categories. In this work, all the zero-shot segmentation tasks are under the GZSL setting unless otherwise specified. 

这里我们给出了零样本图像分割的问题公式。有两个空间,特征空间X和语义空间A,分别表示图像的视觉特征和类别的语义表示,表示为X={Xs,Xu},A={as,Au}。上标s和u分别表示两个不重叠的组,即Ns可见类别和Nu不可见类别。我们使用Y={Ys,Yu}来表示地面实况标签。Y s是可见群的标签集,Y u是不可见群的标记集,Y s≠Y u=∅。训练集是由包含任何Ns个可见类别但没有未可见类别的图像构建的,这与开放词汇范式( open-vocabulary paradigm不同[34,65]。根据测试集中出现的类别,有两种不同的设置,称为零样本学习(ZSL)广义零样本学习(GZSL。ZSL只对看不见类别的测试样本进行分类,GZSL需要对看不到类别的测试数据进行分类。零样本分割自然是一种GZSL,因为图像通常包含多种多样的类别。在本工作中,除非另有规定,否则所有零样本分割任务都在GZSL设置下。

3.2. Primitive Cross-Modal Generation

Due to the lack of unseen samples, the classifier cannot be optimized with features of unseen classes. As a result, the classifier trained on seen classes tends to assign all objects/stuff a label of seen group, which is called bias issue [6]. To address this issue, previous methods [4,27,40] propose to utilize a generative model to synthesize fake visual features for unseen classes. However, previous generative zero-shot segmentation works [4, 27, 40] commonly adopt Generative Moment Matching Network (GMMN) [40,43] or GAN [25], which consist of multiple linear layers as feature generator. Such a generator, though achieves good performance, does not consider the visual-semantic difference of feature granularity. It is well known that image generally contains much richer information than language. Visual information provides very fine-grained attributes of objects while textual information typically provides abstract and high-level attributes. Such difference results in an in-consistency between visual features and semantic features. To address this challenge, we propose a Primitive CrossModal Generator that employs lots of learned attribute primitives to construct visual representations. 

由于缺乏看不见的样本,分类器无法使用看不见类的特征进行优化。因此,在可见类上训练的分类器倾向于为所有对象/材料分配可见组的标签,这被称为偏差问题[6]。为了解决这个问题,以前的方法[4,27,40]提出利用生成模型来合成看不见类的虚假视觉特征。然而,以往的生成性零样本分割工作[4、27、40]通常采用生成性矩匹配网络(generative Moment Matching Network,GMMN)[40、43]或GAN[25],它们由多个线性层组成,作为特征生成器。这样的生成器虽然取得了良好的性能,但没有考虑特征粒度的视觉语义差异。众所周知,图像通常比语言包含更丰富的信息。视觉信息提供对象的细粒度属性,而文本信息通常提供抽象和高级属性。这种差异导致视觉特征和语义特征之间的一致性。为了解决这一挑战,我们提出了一种基元交叉模态生成器,该生成器使用大量学习的属性基元来构建视觉表示

[4] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick Perez. Zero-shot semantic segmentation. ´ NeurIPS, 32, 2019. 1, 2, 3, 4, 6, 8

[27] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. Context-aware feature generation for zeroshot semantic segmentation. In ACM MM, 2020. 1, 2, 3, 4, 8

[40] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural relation learning for zero-shot segmentation. NeurIPS, 33, 2020. 1, 2, 3, 4

As shown in Fig. 3, we build our Primitive Generator with a Transformer architecture. First, a set of learnable primitives are randomly initialized, denoted as P = {pi} N i=1, where pi ∈ R dk and dk is the number of channels. These primitives are assumed to contain very fine-grained attributes related to categories, e.g., hair, color, shape, etc. Different kinds of assembly of these primitives build different representations for categories. A self-attention is first performed on these primitives to construct relationship graph among these primitives. Next, we utilize two different linear layers ωK and ωV to deal with P to obtain the Key and Value for cross attention, denoted as K and V respectively. Then, taking semantic embeddings as Query Q, cross attention is performed as 

如图3所示,我们使用Transformer架构构建了Primitive Generator首先随机初始化一组可学习基元,表示为P={pi}NI=1,其中pi∈Rdk,dk是通道数。假设这些基元包含与类别相关的非常细粒度的属性,例如头发、颜色、形状等。这些基元的不同类型的组合为类别构建了不同的表示首先对这些基元进行自关注,构建这些基元之间的关系图。接下来,我们利用两个不同的线性层ωK和ωV来处理P,以获得交叉注意的密钥和值,分别表示为K和V。然后,将语义嵌入作为查询Q,执行交叉注意如下

where X ′ represents synthetic visual features and Z denotes random sample with a fixed Gaussian distribution. ω1 is the linear layer. Different from feature generation via processing semantic embedding with several linear layers, we synthesize visual features via weighted assembling these abundant primitives, which provides much more diverse and richer representations. Moreover, for related categories that share some similarities in semantic space, primitives provide an explicit way to express such similarities. For example, dog and cat both have the attributes of hairy and tail, so the primitives related to hairy and tail show high response to the semantic embedding query of dog and cat. With such primitives that describe fine-grained attributes, we can easily construct different category representations and transfer the knowledge of seen classes to unseen ones. 


We follow [43] to define our generator loss LG to diminish maximum mean discrepancy between two probability distributions:


  • [43] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML, 2015. 3, 4 

where Xs and Xs′ denote real visual features and synthetic visual features of seen classes, respectively. k is a kernel and k(f, f′ ) = exp(− 1 2σ2 ∥f − f ′∥ 2 ) with bandwidth σ


When a semantic embedding from unseen group is fed into the trained Primitive Generator, we can get its corresponding synthetic class embedding. We then re-train our classifier with both the real class embedding of seen categories and the synthetic class embedding of unseen categories, which greatly alleviates the bias issue. Besides, such global representations are more robust than per-pixel classification [4, 27, 40, 66, 68] and can thus have a better alignment between visual space and semantic space. 


3.3. Semantic-Visual Relationship Alignment

It is well known that relationships among categories are naturally different [7, 40, 60]. For example, there are three objects: apple, orange, and cow. Obviously, the relationship of apple & orange is closer than apple & cow. Class relationships in semantic space are powerful prior knowledge, while the categoryspecific feature generation does not explicitly leverage such relationships. As shown in Fig. 4, we build such relationships with semantic embeddings and explore to transfer this knowledge to visual space, making semanticvisual alignment in terms of class-wise relationships. By considering the relationship, there are more constraints on the unseen categories’ feature generations, to pull or push their distances with seen categories.


Figure 4. Relationship alignment. (a) The conventional relationship alignment. (b) Our proposed two-step relationship alignment. Considering the domain gap, we introduce a semantic-related visual space, where features are disentangled from visual space and have more direct relevance with semantic space. We have the relationship in semantic-related visual space be aligned with semantic space. ui/sj refers to unseen/seen category. Taking u1 dog as an example, we aim to transfer its similarities with {cat, elephant, horse, zebra} from semantic to visual space. 

图4。关系调整。(a) 传统的关系调整。(b) 我们提出的两步关系调整。考虑到领域差距,我们引入了一个与语义相关的视觉空间,其中特征与视觉空间分离,与语义空间具有更直接的相关性。我们使语义相关的视觉空间中的关系与语义空间保持一致。ui/sj是指看不见/看到的类别。以u1狗为例,我们旨在将其与{猫、大象、马、斑马}的相似性从语义转移到视觉空间。

Semantic-related Visual Feature However, the visual features are not fully aligned with the semantic representations but contain richer information including semanticrelated visual features and also semantic-unrelated visual features. Semantic-unrelated features may have strong visual clues and contribute to classification, but have low relevance with language semantic representations. Directly aligning semantic embeddings with original visual features would confuse the generator and reduce its generalization to unseen categories. To address this issue, we propose to disentangle the semantic-related visual features and semanticunrelated visual features. Given a feature xi , where xi ∈ X is the class embedding from either backbone or our generator, feature disentanglement learns how to disentangle and reconstruct xi itself. We use encoder ER to extract semantic-related feature, xˆi = ER(xi). Then, we calculate the correlation score between semantic-related feature xˆi and semantic embeddings A = {a1, ..., aNs+Nu }. ER is trained with cross-entropy loss as a classification problem to endow semantic-related features xˆi with discriminative semantic knowledge, i.e.,

语义相关的视觉特征 然而,视觉特征并不完全与语义表示一致,而是包含更丰富的信息,包括语义相关的可视特征和语义无关的可视特征语义无关特征可能具有较强的视觉线索,有助于分类,但与语言语义表示的相关性较低直接将语义嵌入与原始视觉特征对齐会混淆生成器,并减少其对看不见的类别的泛化。为了解决这个问题,我们建议将语义相关的视觉特征和语义无关的视觉特征区分开来。给定一个特征xi,其中xi∈X是从主干或生成器嵌入的类,特征解缠结学习如何解缠结和重建xi本身。我们使用编码器ER提取语义相关特征,xˆi=ER(xi)。然后,我们计算语义相关特征xõi和语义嵌入A={a1,…,aNs+Nu}之间的相关性得分。ER是用交叉熵损失作为分类问题来训练的,以赋予语义相关特征xõi判别语义知识,即。,

where [ˆxi ] is the ground truth class intex of xˆi , 1(·) is the indicator function that outputs 1 if the condition is true and 0 otherwise. τ is the temperature parameter.

其中[ˆxi]是x \710»i的基本真值类intex,1(·)是指示函数,如果条件为真,则输出1,否则输出0。τ是温度参数。

We use another encoder EU to extract semanticunrelated feature, denoted as x¨i = EU(xi). We suppose the semantic-unrelated features to have the normal distribution N (0, 1) with zero mean and unit variance [39]. We use KL divergence loss to constrain the distribution range, 


where DKL[p||q] = − R p(z)log p(z) q(z) . Such that each class has its own independent and diverse feature component. To push the network to extract more representative semanticrelated features and preserve visual feature information, we reconstruct the feature with a decoder D under ℓ1 loss: 

式中DKL[p||q]=−R p(z)log p(z)q(z)。使得每个类都有自己独立和多样化的特征组件。为了推动网络提取更具代表性的语义相关特征并保存视觉特征信息,我们在解码器中引入ℓ1损失:

The training objective for feature disentanglement is LD = LR + LU + Lrecon. 


Relationship Alignment Then we conduct relationship alignment between semantic-related visual space and semantic space. We use KL divergence loss to make the similarity of any two semantic-related features xˆi and xˆj reach the similarity of their corresponding semantic embeddings a[ˆxi] and a[ˆxj ] , i.e., 

关系对齐 然后我们在语义相关的视觉空间和语义空间之间进行关系对齐。我们使用KL散度损失使任何两个语义相关特征xˆi和xᮼj的相似性达到其对应语义嵌入a[\710,xi]和a[ڮxj]的相似性,即:

where [ˆxi ] is the ground truth class index of xˆi , τ is the temperature parameter to control the sharpness of similarity distribution operating on the KL loss. xˆ s i of the seen group is from either real features or synthetic features while xˆ u i of the unseen group is from synthetic features by generator only. There are two kinds of alignment, intragroup alignment and inter-group alignment, with different focuses in Eq. (6). When xˆi and xˆj are from the same group, e.g., xˆ s i and xˆ s j both from seen group, it is intragroup alignment and contributes to extracting better class representations with the relationships as a constraint. When they are from different groups, e.g., xˆ s i from seen group and xˆ u j from unseen group, it is inter-group alignment that aims to transfer the relationship knowledge from seen to unseen. Inter-group alignment gives constraints on the relationships of seen and unseen categories, real features and synthetic features. It greatly improves the model’s adaptability and generalization to unseen categories.

其中[ˆxi]是x \710»i的基本真值类指数,τ是控制KL损失上相似分布清晰度的温度参数。看到的组的xŞs i来自真实特征或合成特征,而看不见的组的xŞu i仅来自生成器的合成特征。有两种对齐,组内对齐和组间对齐,在等式中焦点不同。(6)。当x Plot i和x Plot j来自同一组时,例如,x Plot s i和xës j都来自可见组,这是组内对齐,并有助于提取更好的类表示,其中关系作为约束。当它们来自不同的组时,例如,来自可见组的xŞs i和来自看不见组的xõu j,组间对齐旨在将关系知识从可见转移到看不见。组间对齐对可见和不可见类别、真实特征和合成特征之间的关系进行了约束。它大大提高了模型对不可见类别的适应性和泛化能力。

Collaborative Disentanglement and Alignment Our disentanglement and alignment are complementary and mutually beneficial. On the one hand, disentanglement promotes relationship alignment. With the disentanglement, semantic-related features can be extracted for alignment and semantic-unrelated noises are excluded. On the other hand, relationship alignment facilitates disentanglement. Introducing intra-group and inter-group alignment, classwise relationship among semantic-related features can be constructed and the discrepancy between semantic-visual feature distributions can be reduced, eventually leading to the improvement of the feature disentanglement. 

协同解纠缠与对齐  我们的解纠缠与对齐是相辅相成的。一方面,解开纠缠会促进关系的协调。通过解纠缠,可以提取与语义相关的特征进行对齐,并排除与语义无关的噪声。另一方面,关系对齐有助于解开纠缠。引入组内和组间对齐,可以构建语义相关特征之间的类关系,减少语义视觉特征分布之间的差异,最终改善特征解纠缠。

3.4. Training Objective 

Algorithm 1 shows the overall training pipeline of our universal zero-shot segmentation model. First, we pre-train our segmentation backbone with annotated data from seen classes in a full-supervision manner. Next, We train the primitive generator under the following objective: 


where λ is the weight to control the importance of the disentanglement and alignment module. Once the generator is trained above, it can generate synthetic features for unseen classes. Together with the real features from seen classes, we can train a new classification layer. 


4. Experiments

4.1. Experimental Setup 

Implementation Details. The proposed network and all our experiments are implemented based on Pytorch. We utilize CLIP text embeddings [54] and word2vec [50] as our semantic embedding and normalize it with ℓ2 normalization. CLIP text embeddings are extracted following the previous works [20,26]. We adopt Mask2Former [11] build upon the ResNet-50 as backbone [30], with 100 queries for both training and inference. Hyper-parameters are consistent with the setting of [11] unless otherwise specified. Encoder ER and EU are both multi-layer perceptron (MLP) containing one hidden layer, LeakyReLU activation and dropout. ED is constructed with two stacked single MLP layers followed by LeakyReLU activation and dropout. We apply SGD optimizer for the parameters of classifier with learning rate 1 × 10−3 , weight decay 5 × 10−4 and momentum 0.9, and Adam optimizer for the parameters of generator, ER, EU, and ED with initial learning rate 2 × 10−4 . The number of the Transformer layers, loss weight λ in Eq. (7), temperature τ , σ are set to 3, 0.002, 0.1, {2, 5, 10, 20, 40, 60}, respectively. 


Datasets. We use the popular dataset MSCOCO 2017, which consists training set with 118k images and validation set with 5k images. For panoptic segmentation, 133 classes (80 thing classes and 53 stuff classes) are included in annotations. For semantic segmentation, COCO-Stuff contains 171 valid classes in total. To get a fair comparison with ZSI [68], we use MSCOCO 2014 for instance segmentation which contains 80k training and 40k validation images. 

数据集。我们使用流行的数据集MSCOCO 2017,该数据集由118k张图像的训练集和5k张图像的验证集组成。对于全景分割,注释中包括133个类(80个 thing classes和53个stuff classes)。对于语义分割,COCO Stuff总共包含171个有效类。为了与ZSI[68]进行公平的比较,我们使用MSCOCO 2014进行分割,该分割包含80k个训练图像和40k个验证图像。

  • [68] Ye Zheng, Jiahong Wu, Yongqiang Qin, Faen Zhang, and Li Cui. Zero-shot instance segmentation. In CVPR, 2021. 1, 2, 4, 6, 8

4.2. Zero-Shot Panoptic Segmentation Task 

Because of the high similarities between semantic segmentation and panoptic segmentation, we develop the ZSP datasets by following the previous ZSS works [62]. In order to avoid any information leakage, SPNet selects 15 classes in COCO stuff that do not appear in ImageNet as unseen classes. In COCO panoptic dataset, we find 14 classes overlapped with the 15 ones selected by SPNet and set them as unseen classes, i.e., {cow, giraffe, suitcase, frisbee, skateboard, carrot, scissors, cardboard, sky-other-merged, grass-merged, playingfield, river, road, tree-merged}, while the remaining 119 classes are set as seen classes. To guarantee no information leakage in the training set, we discard the training images that contain even one pixel of any unseen classes. Thus the model is trained by samples of seen classes only with 45617 training images. We use all 5k validation images to evaluate the performance of ZSP. Panoptic and semantic segmentation tasks are evaluated on the union of thing and stuff classes while instance segmentation is only evaluated on the thing classes. 


  • [62] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Zero-shot instance segmentation . In CVPR, 2019. 1, 2, 6, 8

Evaluation Metrics. Under the GZSL setting, the model needs to segment objects/stuff of both seen and unseen classes, which is closer to real-world complicated scenarios. Following previous ZSS [4, 62], ZSD [2], and ZSI [68] tasks, we compute seen metrics, unseen metrics, and the harmonic mean (HM) of seen metrics and unseen metrics as follows,  


where Pseen and Punseen denote the seen and unseen metrics, respectively. We use the PQ (panoptic quality) metric [37] which can be viewed as the multiplication of a segmentation quality (SQ) and a recognition quality (RQ). We also report the results on instance segmentation, object detection and semantic segmentation tasks. For instance segmentation and object detection, we use the standard mAP (mean Average Precision) [44] with an IoU threshold of 0.5. For semantic segmentation, we use mIoU (mean Intersection-over-Union) [23]. 


4.3. Ablation Study 

In Tab. 1 and Tab. 2, We perform ablation studies of the proposed PADing on MS-COCO dataset under four tasks, including zero-shot panoptic segmentation, zero-shot instance segmentation, zero-shot object detection, and zeroshot semantic segmentation. It is worth noting that the results in Tab. 2 are obtained by the model trained on zero-shot panoptic segmentation task only, which achieves our goal of training a single model for universal zeroshot image segmentation tasks. For simplicity, our ablation analysis mainly focuses on ZSP, because ZSI, ZSD, ZSS have similar trends with ZSP. First, to demonstrate the advantage of introducing generative model, we implement a projection-based segmentation baseline by using CLIP text embeddings as classifier’s weights, similar with ZegFormer-seg [20]. During training, there are 119 text embeddings used in classifier, while during inference, we add another 14 unseen text embeddings into classifier and label each object to one of these 133 classes. As the 2nd row in Tab. 1, there is a strong bias towards seen classes, resulting in extreme low accuracy even zero for unseen group. Next, we construct baseline build upon generative GMMN model following ZS3 [4], which outperforms projection-based method by 4.9% in terms of unseen PQ. This phenomenon shows that generative model contributes to solving crucial bias issue. 

在表1和表2中,我们在四个任务下对MS-COCO数据集上提出的PADing进行消融研究,包括零样本全景分割、零样本实例分割、零样本对象检测和零快照语义分割。值得注意的是,表2中的结果是由仅针对零样本全景分割任务训练的模型获得的,这实现了我们为通用零快照图像分割任务训练单个模型的目标。为了简单起见,我们的烧蚀分析主要集中在ZSP上,因为ZSI、ZSD、ZSS与ZSP的趋势相似。首先,为了证明引入生成模型的优势,我们通过使用CLIP文本嵌入作为分类器的权重来实现基于投影的分割基线,类似于ZegFormer seg[20]。在训练过程中,分类器中使用了119个文本嵌入,而在推理过程中,我们将另外14个看不见的文本嵌入添加到分类器中,并将每个对象标记到这133个类中的一个。如表1中的第二行所示,对可见类存在强烈的偏见,导致不可见组的精度极低,甚至为零。接下来,我们根据ZS3[4]在生成GMMN模型的基础上构建基线,该模型在看不见的PQ方面比基于投影的方法好4.9%。这一现象表明,生成模型有助于解决关键的偏差问题。

 Table 1. Zero-shot panoptic segmentation ablation study results on MSCOCO. G, P, A, D denote GMMN generator, primitive generator, disentanglement, and alignment, respectively. 

表1。MSCOCO上零样本全景分割消融研究结果。G、 P、A、D分别表示GMMN生成器、基元生成器、解纠缠和对准。

Table 2. Ablation study on ZSD, ZSI, and ZSS. G, P, A, D denote GMMN generator, primitive generator, disentanglement, and alignment, respectively. The results validate our goal of training a single model for universal zero-shot image segmentation tasks. 

表2。ZSD、ZSI和ZSS的烧蚀研究。G、 P、A、D分别表示GMMN生成器、基元生成器、解纠缠和对准。结果验证了我们为通用零样本图像分割任务训练单个模型的目标。

Number of Primitives. We report the network’s performance with different numbers of primitives in Tab. 3. From the results, increasing the primitive number from 100 to 400 brings a significant performance gain of 4.2%. The performance is a little down when the primitive number is larger than 400, thus we choose 400 as the default setting. 


Effectiveness of Alignment. Then, by applying semantic alignment as a constraint to our generator, the HMPQ is further improved by 2.6%, demonstrating the effectiveness of introducing inter-class relationships inherent from semantic space. Finally, we evaluate the alignment module with disentanglement, see 6) PADing in Tab. 1 and Tab. 2. In comparison to using alignment only, alignment+disentanglement transfers semantic prior knowledge on semantic-related features and consistently brings performance gains of 2.0% HM-PQ, 13.3% HM-SQ, and 2.7% HM-RQ. The significant improvement demonstrates that the semantic-visual discrepancy has been alleviated owing to omitting semantic-unrelated noises. The utilization of disentanglement enables more effective alignment in the separated semantic-related space. 


Visualization of synthesized feature representations. To study the properties of our synthesized unseen features and demonstrate the effectiveness of our proposed approach, we employ t-SNE [58] to show the distribution of our synthetic features in Fig. 5. As we can see in Fig. 5 (a), the synthesized features produced by GMMN generator are messy due to the semantic-visual discrepancy. In Fig. 5 (b), when introducing our primitive generator, features belonging to the same class become more compact and features from different classes are highly separable. Furthermore, after applying relationship-alignment constraint on the semantic-related feature, see Fig. 5 (c), features belonging to different classes are farther apart with better-structured distributions, which shows that the structure relationship is embedded into synthetic features and the synthesized unseen features are greatly enhanced with better discrimination. 


4.4. Comparison with State-of-the-art ZSS Methods

To further validate the superiority of our approach, we compare it with previous state-of-the-art ZSS methods on the challenging semantic segmentation datasets COCOStuff in Tab. 4. It is worth noting that we only report results without self-training and without complicated crop-mask image preprocess utilized for CLIP image encoder for a fair comparison. We train our model with semantic segmentation annotations. The proposed approach outperforms the previous best method ZegFormer-seg [20] by 3.5% HM-IoU and 3.4% unseen-IoU, demonstrating its effectiveness. It is worth noting that the above methods use ResNet-101 while we only use ResNet-50.

为了进一步验证我们的方法的优越性,我们在表4中富有挑战性的语义分割数据集COCOStuff上将其与以前最先进的ZSS方法进行了比较。值得注意的是,我们只报告了没有自我训练的结果,也没有用于CLIP图像编码器的复杂裁剪掩模图像预处理,以进行公平的比较。我们使用语义分割注释来训练我们的模型。所提出的方法比以前的最佳方法ZegFormer seg[20]高出3.5%的HM IoU和3.4%的隐形IoU,证明了其有效性。值得注意的是,上述方法使用ResNet-101,而我们仅使用ResNet-50。

4.5. Comparison with State-of-the-art ZSI Methods 

We compare the proposed method with the previous state-of-the-art method ZSI [68] under the Generalized Zero-Shot Instance Segmentation (GZSI) setting in Tab. 5. Our model is trained with instance segmentation annotations for a fair comparison. We achieve new state-of-theart performance on both 48/17 split and 65/15 split. For example, we surpass ZSI by 7.20% HM-mAP and 5.27% HM-Recall on 48/17 split. It is worth noting that ZSI [68] uses ResNet-101 while we use ResNet-50.

在表5中的广义零样本实例分割(GZSI)设置下,我们将所提出的方法与之前最先进的方法ZSI[68]进行了比较。为了进行公平的比较,我们的模型使用实例分割注释进行了训练。我们在48/17分割和65/15分割上都实现了最先进的性能。例如,我们以7.20%的HM mAP和5.27%的HM Recall在48/17的拆分中超过了ZSI。值得注意的是,ZSI[68]使用ResNet-101,而我们使用ResNet-50。

 4.6. Qualitative Results

To qualitatively demonstrate the effectiveness of our proposed approach, we visualize some examples of zeroshot panoptic segmentation results in Fig. 6. The second row is ground-truth mask while the third and fourth rows are predicted masks by baseline and our proposed approach, respectively. We observe that our PADing successively finds several unseen classes, e.g., suitcase, grass, frisbee, road, tree, skateboard, that are missed or misclassified by the baseline model. Besides, thanks to the class-agnostic mask generation ability of Mask2Former [11], our results show high-quality masks. 


Figure 6. Qualitative results on the COCO for ZSP. The first row presents input images and the subsequent rows illustrate ground-truth masks, predictions of the baseline, and predictions of our PADing. 

图6。ZSP COCO的定性结果。第一行显示输入图像,随后的行显示地面实况掩码、基线预测和我们的PADing预测。

5. Conclusion

We propose primitive generation with collaborative relationship alignment and feature disentanglement learning (PADing) as a unified framework to achieve universal zeroshot segmentation. A primitive generator is proposed to synthesize fake training features for unseen classes. A collaborative feature disentanglement and relationship alignment learning strategy is proposed to help the generator produce better fake unseen features, where the former one decouples visual features to semantic-related part and semantic-unrelated part and the later one transfer inter-class knowledge from semantic space to visual space. Extensive experiments on three zero-shot segmentation tasks demonstrate the effectiveness of the proposed approach. 



