admin管理员组

文章数量:1589800

Abstract

我们提出了模块化交互式 VOS (MiVOS) 框架,该框架将interaction-to-mask和mask propagation解耦,从而实现更高的通用性和更好的性能。单独训练的交互模块将用户交互转换为对象掩码,然后由我们的传播模块在读取space-time memory时使用新的 top-k 过滤策略进行时间传播。为了有效地考虑用户的意图,提出了一种新颖的difference-aware module来学习如何在每次交互之前和之后正确融合掩码,这些掩码通过使用space-time memory与目标帧对齐。我们在 DAVIS 上使用不同形式的用户交互(例如,涂鸦、点击)对我们的方法进行了定性和定量评估,以表明我们的方法优于当前最先进的算法,同时需要更少的帧交互,在泛化方面具有额外优势 针对不同类型的用户交互。我们贡献了一个具有 480 万帧像素精确分割的大规模合成 VOS 数据集,以配合我们的源代码,以促进未来的研究。

Introduction

interactive VOS(iVOS)

特点: interactive VOS方法将用户交互(例如,涂鸦或点击)作为输入,用户可以在其中迭代地细化结果直到满意。

包含的两个任务:

  • interaction understanding
  • temporal propagation

Existing Problem

(1)The strong coupling limits the form of user interaction (e.g., scribbles only) and makes training difficult.Attempts to decouple the two tasks fail to reach state-of-the-art accuracy as user’s intent cannot be adequately taken into account in the propagation process.

强耦合限制了用户交互的形式(例如,仅涂鸦)并使训练变得困难。由于在传播过程中无法充分考虑用户的意图,尝试将这两个任务解耦未能达到最先进的准确性 .

(2)naive decoupling may lead to loss of user’s intent as the original interaction is no longer available in the propagation stage.

naive解耦可能会导致失去用户的意图,因为原始交互在传播阶段不再可用。

Solution

We present a decoupled modular framework to address the iVOS problem.

Contributions

  • We innovate on the decoupled interaction-propagation framework and show that this approach is simple, effective, and generalizable.我们对解耦的交互传播框架进行了创新,并表明这种方法简单、有效且可推广。
  • We propose a novel lightweight top-k filtering scheme for the attention-based memory read operation in mask generation during propagation.我们提出了一种新颖的轻量级 top-k 过滤方案,用于在传播过程中的掩码生成中基于注意力的内存读取操作。
  • We propose a novel difference-aware fusion module to faithfully capture the user’s intent which improves iVOS accuracy and reduces the amount of user interaction.我们提出了一种新颖的差异感知融合模块来忠实地捕捉用户的意图,从而提高 iVOS 的准确性并减少用户交互量。
  • We contribute a large-scale synthetic VOS dataset with 4.8M frames to accompany our source codes to facilitate future research.我们提供了一个具有 480 万帧的大规模合成 VOS 数据集,以配合我们的源代码,以促进未来的研究。

Related Work

Progress in iVOS is shown below:

Semi-Supervised Video Object Segmentation

defination: segment a specific object throughout a video given only a fully-annotated mask in the first frame.

Interactive Video Object Segmentation (iVOS)

focus:

(1)scribble interaction

(2)click interaction

Interactive Image Segmentation

Method

Initial Work

Initially, the user selects and interactively annotates one frame (e.g., using scribbles or clicks) to produce a mask.

最初,用户选择并交互式地注释一帧(例如,使用涂鸦或点击)以生成蒙版。

MiNet Overview

Character Denfination

(1)We denote r as the current interaction round

(2)the user-interacted frame index in the r-th round is tr

(3)the mask results of the r-th round is Mr

(4)the mask of individual j-th frame is denoted as M rj

Core Component

interaction-to-mask:allowing the user to obtain real-time feedback and achieve a satisfactory result on a single frame

mask propagation: the corrected mask is bidirectionally propagated

difference-aware fusion: use the two sequences while avoiding possible decay or loss of user’s intent.

how to capture the user’s intent:use the difference in the selected mask before and after user interaction

Figure

Interaction-to-Mask

Scribble-to-Mask(S2M)

Goal: produce a single-image segmentation in real time given input scribbles

backbone: DeepLabV3+ semantic segmentation network

Local Control

previous state-of-the-art approach:it may harm the global result when only local fine adjustment is needed toward the end of the segmentation process.

the source of previous state-of-the-art approach:

Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, and Anton Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. In CVPR, 2020. 1, 2, 3, 4, 7, 8

our approach:it is straightforward to assert local control by limiting the interactive algorithm to apply in a user-specified region

the comparison of above two approaches:

Temporal Propagation

Goal: tracks the object and produces corresponding masks in subsequent frames.

Memory Read with Top-k Filtering

(1)计算affinity

F ∈ R THW ×HW represents the affinity between a query position and a memory position

(2)filter the affinities such that only the top-k entries are kept

作用:effectively removes noises regardless of the sequence length

优点:increase robustness and overcome the overhead of top-k

(3)For query position j, the feature mj is read from memory by:

(4)concatenate the read features with vQ

the process is shown below:

Propagation strategy

our propagation scheme:

Difference-Aware Fusion

(1)compute the positive and negative changes separately as two masks D+ and D−

说明:(·)+ is the max(·, 0)

(2)compute the aligned masks

说明:W来自Memory Read with Top-k Filtering中的第二步

(3)feed these features into a simple five-layer residual network which is terminated by a sigmoid to output a final fused mask

Mechanism of the difference-aware fusion module:

说明:

Experiment

Performance on the DAVIS interactive validation set:

Conclusion

我们提出 MiVOS,一种由三个模块组成的新型解耦方法:Interaction-to-Mask, Propagation and Difference-Aware Fusion.通过将交互与传播解耦,MiVOS 是通用的,并且不受交互类型的限制。另一方面,所提出的fusion module通过忠实地捕捉用户的意图来协调交互和传播,并减少在解耦过程中丢失的信息,从而使 MiVOS 既准确又高效。我们希望我们的 MiVOS 能够激发和激发 iVOS 的未来研究

本文标签: 论文InteractiveVideoobjectModular