


我们提出了模块化交互式 VOS (MiVOS) 框架,该框架将interaction-to-mask和mask propagation解耦,从而实现更高的通用性和更好的性能。单独训练的交互模块将用户交互转换为对象掩码,然后由我们的传播模块在读取space-time memory时使用新的 top-k 过滤策略进行时间传播。为了有效地考虑用户的意图,提出了一种新颖的difference-aware module来学习如何在每次交互之前和之后正确融合掩码,这些掩码通过使用space-time memory与目标帧对齐。我们在 DAVIS 上使用不同形式的用户交互(例如,涂鸦、点击)对我们的方法进行了定性和定量评估,以表明我们的方法优于当前最先进的算法,同时需要更少的帧交互,在泛化方面具有额外优势 针对不同类型的用户交互。我们贡献了一个具有 480 万帧像素精确分割的大规模合成 VOS 数据集,以配合我们的源代码,以促进未来的研究。


interactive VOS(iVOS)

特点: interactive VOS方法将用户交互(例如,涂鸦或点击)作为输入,用户可以在其中迭代地细化结果直到满意。


  • interaction understanding
  • temporal propagation

Existing Problem

(1)The strong coupling limits the form of user interaction (e.g., scribbles only) and makes training difficult.Attempts to decouple the two tasks fail to reach state-of-the-art accuracy as user’s intent cannot be adequately taken into account in the propagation process.

强耦合限制了用户交互的形式(例如,仅涂鸦)并使训练变得困难。由于在传播过程中无法充分考虑用户的意图,尝试将这两个任务解耦未能达到最先进的准确性 .

(2)naive decoupling may lead to loss of user’s intent as the original interaction is no longer available in the propagation stage.



We present a decoupled modular framework to address the iVOS problem.


  • We innovate on the decoupled interaction-propagation framework and show that this approach is simple, effective, and generalizable.我们对解耦的交互传播框架进行了创新,并表明这种方法简单、有效且可推广。
  • We propose a novel lightweight top-k filtering scheme for the attention-based memory read operation in mask generation during propagation.我们提出了一种新颖的轻量级 top-k 过滤方案,用于在传播过程中的掩码生成中基于注意力的内存读取操作。
  • We propose a novel difference-aware fusion module to faithfully capture the user’s intent which improves iVOS accuracy and reduces the amount of user interaction.我们提出了一种新颖的差异感知融合模块来忠实地捕捉用户的意图,从而提高 iVOS 的准确性并减少用户交互量。
  • We contribute a large-scale synthetic VOS dataset with 4.8M frames to accompany our source codes to facilitate future research.我们提供了一个具有 480 万帧的大规模合成 VOS 数据集,以配合我们的源代码,以促进未来的研究。

Related Work

Progress in iVOS is shown below:

Semi-Supervised Video Object Segmentation

defination: segment a specific object throughout a video given only a fully-annotated mask in the first frame.

Interactive Video Object Segmentation (iVOS)


(1)scribble interaction

(2)click interaction

Interactive Image Segmentation


Initial Work

Initially, the user selects and interactively annotates one frame (e.g., using scribbles or clicks) to produce a mask.


MiNet Overview

Character Denfination

(1)We denote r as the current interaction round

(2)the user-interacted frame index in the r-th round is tr

(3)the mask results of the r-th round is Mr

(4)the mask of individual j-th frame is denoted as M rj

Core Component

interaction-to-mask:allowing the user to obtain real-time feedback and achieve a satisfactory result on a single frame

mask propagation: the corrected mask is bidirectionally propagated

difference-aware fusion: use the two sequences while avoiding possible decay or loss of user’s intent.

how to capture the user’s intent:use the difference in the selected mask before and after user interaction




Goal: produce a single-image segmentation in real time given input scribbles

backbone: DeepLabV3+ semantic segmentation network

Local Control

previous state-of-the-art approach:it may harm the global result when only local fine adjustment is needed toward the end of the segmentation process.

the source of previous state-of-the-art approach:

our approach:it is straightforward to assert local control by limiting the interactive algorithm to apply in a user-specified region

the comparison of above two approaches:

Temporal Propagation

Goal: tracks the object and produces corresponding masks in subsequent frames.

Memory Read with Top-k Filtering


F ∈ R THW ×HW represents the affinity between a query position and a memory position

(2)filter the affinities such that only the top-k entries are kept

作用:effectively removes noises regardless of the sequence length

优点:increase robustness and overcome the overhead of top-k

(3)For query position j, the feature mj is read from memory by:

(4)concatenate the read features with vQ

the process is shown below:

Propagation strategy

our propagation scheme:

Difference-Aware Fusion

(1)compute the positive and negative changes separately as two masks D+ and D−

说明:(·)+ is the max(·, 0)

(2)compute the aligned masks

说明:W来自Memory Read with Top-k Filtering中的第二步

(3)feed these features into a simple five-layer residual network which is terminated by a sigmoid to output a final fused mask

Mechanism of the difference-aware fusion module:



Performance on the DAVIS interactive validation set:


我们提出 MiVOS,一种由三个模块组成的新型解耦方法:Interaction-to-Mask, Propagation and Difference-Aware Fusion.通过将交互与传播解耦,MiVOS 是通用的,并且不受交互类型的限制。另一方面,所提出的fusion module通过忠实地捕捉用户的意图来协调交互和传播,并减少在解耦过程中丢失的信息,从而使 MiVOS 既准确又高效。我们希望我们的 MiVOS 能够激发和激发 iVOS 的未来研究

