admin管理员组

文章数量:1623788

机器翻译,日后再核对.
这篇文章比较别扭。慢慢理解。

摘要

The problem of tracking multiple objects in a video sequence poses several challenging tasks. For tracking-by-detection, these include object re-identification, motion prediction and dealing with occlusions. We present a tracker (without bells and whistles) that accomplishes tracking without specifically targeting any of these tasks, in particular, we perform no training or optimization on tracking data. To this end, we exploit the bounding box regression of an object detector to predict the position of an object in the next frame, thereby converting a detector into a Tracktor. We demonstrate the potential of Tracktor and provide a new state-of-the-art on three multi-object tracking benchmarks by extending it with a straightforward re-identification and camera motion compensation.
We then perform an analysis on the performance and failure cases of several state-of-the-art tracking methods in comparison to our Tracktor. Surprisingly, none of the dedicated tracking methods are considerably better in dealing with complex tracking scenarios, namely, small and occluded objects or missing detections. However, our approach tackles most of the easy tracking scenarios. Therefore, we motivate our approach as a new tracking paradigm and point out promising future research directions. Overall, Tracktor yields superior tracking performance than any current tracking method and our analysis exposes remaining and unsolved tracking challenges to inspire future research directions.

在一个视频序列中跟踪多个目标包含几个有挑战性的任务。就“基于检测的跟踪”这个技术而言,它包括目标再识别、运动预测以及处理遮挡。我们提出一种(没有任何额外设计的)追踪算法,在不专门针对这些目标设计的前提下,实现目标的追踪。特别是,我们不对追踪数据做任何的训练或者优化。为此,我们利用对象检测器的包围框回归来预测下一帧中对象的位置,从而将检测器转换为跟踪器。我们展示了Tracktor的潜力,通过直接的重新识别和摄像机运动补偿对其进行扩展,在三个多对象跟踪基准上取得了最优的性能。

然后,我们对几种最先进的跟踪方法的性能和故障案例进行分析,并与我们的跟踪器进行比较。 出人意料的是,在处理复杂的跟踪方案(即较小的物体和被遮挡的物体或缺少检测物)时,没有一种专用的跟踪方法会更好。 但是,我们的方法可以解决大多数容易跟踪的情况。 因此,我们将我们的方法作为一种新的跟踪范例进行激励,并指出有希望的未来研究方向。 总体而言,Tracktor的跟踪性能要优于任何当前的跟踪方法,并且我们的分析揭示了仍存在和尚未解决的跟踪挑战,以激发未来的研究方向。

Scene understanding from video remains one of the big challenges of computer vision. Humans are often the center of attention in a scene, which leads to the fundamental problem of detecting and tracking them in a video. Tracking-bydetection has emerged as the preferred paradigm to solve the problem of tracking multiple objects as it simplifies the task by breaking it into two steps: (i) detecting object locations independently in each frame, (ii) form tracks by linking corresponding detections across time. The linking step,or data association, is a challenging task on its own, due to missing and spurious detections, occlusions, and target interactions in crowded environments. To address these issues, research in this area has produced increasingly complex models achieving only marginally better results, e.g., multiple object tracking accuracy has only improved 2.4% in the last two years on the MOT16 [45] benchmark.

从视频中理解场景仍然是计算机视觉的一大挑战。而人通常是场景中关注的焦点,这就导致了(人员的检测和跟踪)是计算机视觉的一个基本问题。通过检测进行跟踪已成为解决多目标跟踪问题的首选范式,因为它将任务简化为两个步骤:1)在每一帧中独立检测目标位置;2)通过跨时间连接相应的检测来形成轨迹。由于缺失和虚假检测、遮挡和拥挤环境中的目标交互,连接步骤或数据关联本身就是一项具有挑战性的任务。为了解决这些问题,该领域的研究已经产生了越来越复杂的模型,但只能获得略微更好的结果,例如,在过去两年里,以MOT16[45]为基准的多目标跟踪精度仅提高了2.4%。

In this paper, we push tracking-by-detection to the limit by using only an object detection method to perform tracking. We show that one can achieve state-of-the-art tracking results by training a neural network only on the task of detection. As indicated by the blue arrows in Figure 1, the regressor of an object detector such as Faster-RCNN [52] is sufficient to construct object trajectories in a multitude of challenging tracking scenarios. This raises an interesting question that we discuss in this paper: If a detector can solve most of the tracking problems, what are the real situations where a dedicated tracking algorithm is necessary? We hope our work and the presented Tracktor allows researchers to focus on the still unsolved critical challenges of multi-object tracking.

在本文中,我们通过只使用一种目标检测方法来进行跟踪,从而将检测跟踪的性能推向极限。我们的工作表明,仅通过对检测任务进行训练的神经网络,就可以实现最新的跟踪结果。如图1中的蓝色箭头所示,物体检测器(如Faster-RCNN[52])的回归器足以在众多具有挑战性的跟踪场景中构建物体轨迹。这就引出了我们在本文中讨论的一个有趣的问题:如果检测器可以解决大多数跟踪问题,那么什么情况下需要专用的跟踪算法呢?我们希望我们的工作和提出的Tracktor能够使研究人员专注于尚未解决的多对象跟踪的关键挑战。

图 1. Tracktor提出实现多目标跟踪只有一个对象探测器,由两个主要处理步骤,分别用蓝色和红色表示。对于一个给定的第 t t t 帧 ,首先,目标的回归探测器将已经 在第 t − 1 t-1 t1帧存在的跟踪框架的边框 B t − 1 k \mathcal{B}^k_{t-1} Bt1k 在第 t t t 帧里找到个新位置。将新边界框的位置相应的目标分类得分 s t k s_t^k stk 用于将可能被遮挡的追踪清除出去。第二,目标检测器(或给定的公共检测集)提供了一组第 t t t 帧的检测结果 D t \mathcal{D}^t Dt ;最后,如果新的检测结果和之前追踪到的目标集的任何目标的IoU不够大,那则认为是个新的目标出现。

This paper presents four main contributions:

  • We introduce the Tracktor which tackles multi-object tracking by exploiting the regression head of a detector to perform temporal realignment of object bounding boxes.
  • We present two simple extensions to Tracktor, a reidentification Siamese network and a motion model. The resulting tracker yields state-of-the-art performance in three challenging multi-object tracking benchmarks.
  • We conduct a detailed analysis on failure cases and challenging tracking scenarios, and show none of the dedicated tracking methods perform substantially better than our regression approach.
  • We propose our method as a new tracking paradigm which exploits the detector and allows researchers to focus on the remaining complex tracking challenges. This includes an extensive study on promising future research directions.

本文的贡献主要有4个方面:

  • 我们引入了跟踪器,该跟踪器通过利用检测器的回归头来执行对象边界框的时间重新对齐,从而解决了多对象跟踪问题。
  • 我们提出了Tracktor的两个简单的扩展,一个再识别的孪生网络和一个运动模型。由此产生的跟踪器在三个具有挑战性的多目标跟踪基准中达到最先进的性能。
  • 我们对失败案例和具有挑战性的跟踪场景进行了详细的分析,结果表明,没有一种专用的跟踪方法比我们的回归方法具有更好的性能。
  • 我们提出了作为检测器的一种新的跟踪范式,该方法可以利用检测器并使研究人员专注于剩余的复杂跟踪挑战。 这包括对有前途的未来研究方向的广泛研究。

相关工作

Several computer vision tasks such as surveillance, activity recognition or autonomous driving rely on object trajectories as input. Despite the vast literature on multiobject tracking [42, 38], it still remains a challenging problem, especially in crowded environments where occlusions and false detections are common. Most state-of-the-art works follow the tracking-by-detection paradigm which heavily relies on the performance of the underlying detection method.

一些计算机视觉任务,如监视、活动识别或自动驾驶,都将目标轨迹作为输入信息。尽管有大量关于多目标跟踪的文献[42,38],但它仍然是一个具有挑战性的问题,特别是在拥挤的环境中,遮挡和错误检测是常见的。大多数最先进的工作遵循基于检测的跟踪范式,这严重依赖于底层检测方法的性能。

Recently, neural network based detectors have clearly outperformed all other methods for detection [33, 52, 50]. The family of detectors that evolved to Faster-RCNN [52], and further detectors such as SDP [63], rely on object proposals which are passed to an object classification and a bounding box regression head of a neural network. The latter refines bounding boxes to fit tightly around the object. In this paper, we show that one can rethink the use of this regressor for tracking purposes.

最近,基于神经网络的检测器明显优于所有其他检测方法[33,52,50]。进化到Faster-RCNN[52]的检测器家族,以及进一步的检测器如SDP[63],依赖于传递给对象分类和神经网络的边界框回归部分给出的目标提议,后者改进了边界框,使其更紧密地围绕着对象。在本文中,我们表明人们可以重新考虑使用这个回归因子于跟踪。

Tracking as a graph problem. The data association problem deals with keeping the identity of the tracked objects given the available detections. This can be done on a frame by frame basis for online applications [5, 15, 48] or track-by-track [3]. Since video analysis can be done offline, batch methods are preferred since they are more robust to occlusions. A common formalism is to represent the problem as a graph, where each detection is a node, and edges indicate a possible link. The data association can then be formulated as maximum flow [4] or, equivalently, minimum cost problem with either fixed costs based on distance [26, 49, 66], including motion models [39], or learned costs [36]. Alternative formulations typically lead to more involved optimization problems, including minimum cliques [65], general-purpose solvers like MCMC [64] or multi-cuts [59]. A recent trend is to design ever more complex models which include other vision input such as reconstruction for multi-camera sequences [40, 60], activity recognition [12], segmentation [46], keypoint trajectories [10] or joint detection [59]. In general, the significantly higher computational costs do not translate to significantly higher accuracy. In fact, in this work, we show that we can outperform all graph-based trackers significantly while keeping the tracker online. Even within a graphical model optimization, one needs to define a measure to identify whether two bounding boxes belong to the same person or not. This can be done by analyzing either the appearance of the pedestrian, or its motion.

用求解图问题的思路解决跟踪。 数据关联解决的是在有可用检测的情况下保持被跟踪目标的身份的问题。这可以在一帧一帧的基础上完成在线应用[5,15,48]或逐轨[3]。由于视频分析可以离线完成,因此,批处理方法是首选的,因为它们对遮挡的处理能力更强大。常见的形式是将问题表示为一个图,其中每个检测是一个节点,边表示一个可能的匹配。数据关联可以表述为最大流量[4]或等价的最小成本问题,即基于距离的固定成本[26,49,66],包括运动模型[39]或学习成本[36]。替代公式通常会导致更复杂的优化问题,包括最小团[65]、通用求解器如MCMC[64]或多重切[59]。最近的趋势是设计更加复杂的模型,包括其他视觉输入,如多摄像机序列重建[40,60],活动识别[12],分割[46],关键点轨迹[10]或联合检测[59]。一般来说,更高的计算成本并不能转化为更高的精度。事实上,在这项工作中,我们证明了在保持跟踪器在线的情况下,我们可以显著地优于所有基于图的跟踪器。即使在图形化模型优化中,也需要定义一个度量来确定两个边界框是否属于同一个人。这可以通过分析行人的外观或其运动来实现。

Appearance models and re-identification. Discriminating and re-identifying (reID) objects by appearance is in particular a problem in crowded scenes with many object-object occlusions. In the exhaustive literature that uses appearance models or reID methods to improve multi-object tracking, color-based models are very common [31]. However, these are not always reliable for pedestrian tracking, since people can wear very similar clothes, and color statistics are often contaminated by background pixels and illumination changes. The authors of [34] borrow ideas from person re-identification and adapt them to “re-identify” targets during tracking. In [62], a CRF model is learned to better distinguish pedestrians with similar appearance. Both appearance and short-term motion in the form of optical flow can be used as input to a Siamese neural network to decide whether two boxes belong to the same track or not [35]. Recently, [54] showed the importance of learned reID features for multi-object tracking. We confirm this view in our experiments.

外观模型和重新识别。 通过外观来识别和重新识别(reID)对象是一个特别的问题,在拥挤的场景中,有许多物-物遮挡。在详尽的文献中,使用外观模型或reID方法来改进多目标跟踪,基于颜色的模型[31]是非常常见的。然而,对于行人跟踪来说,这些并不总是可靠的,因为人们可能穿着非常相似的衣服,颜色统计数据经常受到背景像素和光照变化的影响。文献[34]的作者借鉴了人员再识别的思想,并将其应用于跟踪过程中的“再识别”目标。在文献[62]中,我们学习了CRF模型来更好地区分外观相似的行人。外观和光流形式的短期运动都可以作为孪生神经网络的输入来判断两个包围框是否属于同一个轨迹[35]。最近,文献[54]表明了学习到的reID特征对多目标跟踪的重要性。我们在实验中证实了这一观点。

Motion models and trajectory prediction. Several works resort to motion to discriminate between pedestrians, especially in highly crowded scenes. The most common assumption is the one of constant velocity (CVA) [11, 2], but pedestrian motion gets more complex in crowded scenarios for which researchers have turned to the more expressive Social Force Model [57, 48, 61, 39]. Such a model can also be learned from data [36]. Deep Learning has been extensively used to learn social etiquette in crowded scenarios for trajectory prediction [39, 1, 55]. [67] use single object tracking trained networks to create tracklets for further postprocessing into trajectories. Recently, [7, 51] proposed to use reinforcement learning to predict the position of an object in the next frame. While [7] focuses on single object tracking, the authors of [51] train a multi-object pedestrian tracker composed of a bounding box predictor and a decision network for collaborative decision making between tracked objects.

运动模型和轨迹预测。 一些作品通过动作来区分行人,尤其是在高度拥挤的场景中。最常见的假设是恒速(CVA)[11,2],但在拥挤的场景中,行人的运动变得更加复杂,研究人员转向了更具表达力的社会力模型[57,48,61,39]。这种模型也可以从数据[36]中学习到。深度学习已被广泛用于学习拥挤场景下的社交礼仪,用于轨迹预测[39,1,55]。[67]使用单一目标跟踪训练网络来创建轨迹,以便进一步后处理成轨迹。最近,[7,51]提出使用强化学习来预测下一帧对象的位置。[7]关注的是单目标跟踪,而[51]的作者训练了一个由边界框预测器和决策网络组成的多目标行人跟踪器,用于跟踪目标之间的协同决策。

Video object detection. Multi-object tracking without frame-to-frame identity prediction is a subproblem usually referred to as video object detection. In order to improve detections, many methods exploit spatio-temporal consistencies of object positions. Both [28] and [27] generate multiframe bounding box tuplet proposals and extract detection scores and features with a CNN and LSTM, respectively. Recently, the authors of [47] improve object detections by applying optical flow to propagate scores between frames. Eventually, [18] proposes to solve the tracking and detection problem jointly. They propose a network which processes two consecutive frames and exploits tracking ground truth data to improve detection regression, thereby, generating two-frame tracklets. With a subsequent offline method, these tracklets are combined to multi-frame tracks. However, we show that our regression tracker is not only online, but superior in dealing with object occlusions. In particular, we do not only temporally align detections, but preserve their identity.

视频目标检测。 不需要在帧间进行身份预测的多目标追踪是视频目标检测的一个子问题。为了改进检测,许多方法都利用了目标位置的时空一致性。文献[28]和[27]分别使用CNN和LSTM生成多帧包围框元组建议,提取检测分数和特征。最近,*文献[47]的作者通过应用光流在帧之间传播分数来改进目标检测。*最后,文献[18]提出了联合解决跟踪和检测问题。他们提出了一个网络,该网络处理两个连续的帧,并利用跟踪标注数据来改进检测回归,从而产生两帧轨迹片段(tracklet),然后使用脱机方法将这些轨迹片段组合为运动轨迹。 然而我们证明了我们的回归跟踪器不仅在线,而且在处理对象遮挡方面也更胜一筹。 特别是,我们不仅在时间上对齐检测,而且还保留其身份信息。

Detector足矣

We propose to convert a detector into a Tracktor performing multiple object tracking. Several CNN-based detection algorithms [52, 63] contain some form of bounding box refinement through regression. We propose an exploitation of such a regressor for the task of tracking. This has two key advantages: (i) we do not require any tracking specific training, and (ii) we do not perform any complex optimization at test time, hence our tracker is online. Furthermore, we show that our method achieves state-of-the-art performance on several challenging tracking scenarios.

我们提出将探测器转换为跟踪器使用来执行多目标跟踪。一些基于CNN的检测算法[52,63]包含了通过回归来改进边界框的某种形式。我们利用这种回归因子来完成跟踪任务。这有两个关键优势:1)我们不需要进行任何专为追踪的训练,以及 2)我们在测试时不执行任何复杂的优化,因此我们的跟踪器是在线的。此外,我们表明我们的方法在几个有挑战性的跟踪场景下达到了最先进的性能。

目标检测器

The core element of our tracking pipeline is a regressionbased detector. In our case, we train a Faster R-CNN [52] with ResNet-101 [22] and Feature Pyramid Networks (FPN) [41] on the MOT17Det [45] pedestrian detection dataset.

我们的跟踪算法核心元素是基于回归的检测器。在我们的案例中,我们在在MOT17Det[45]行人检测数据集上训练由ResNet-101[22]和特征金字塔网络(FPN)[41]组成的Faster R-CNN[52]。

To perform object detection, Faster R-CNN applies a Region Proposal Network to generate a multitude of bounding box proposals for each potential object. Feature maps for each proposal are extracted via Region of Interest (RoI) pooling [21], and passed to the classification and regression heads. The classification head assigns an object score to the proposal, in our case, it evaluates the likelihood of the proposal showing a pedestrian. The regression head refines the bounding box location tightly around an object. The detector yields the final set of object detections by applying non-maximum-suppression (NMS) to the refined bounding box proposals. Our presented method exploits the aforementioned ability to regress and classify bounding boxes to perform multi-object tracking.

为了进行对象检测,Faster R-CNN应用区域提名网络为每个潜在的对象生成大量的目标包围框建议。每个建议的特征映射通过[21]池的感兴趣区域(RoI)提取,并传递给分类和回归头。分类头给建议分配一个目标分数,在我们的案例中,它评估提案显示行人的可能性。回归头部细化了紧围绕一个对象的边界框位置。检测器通过应用非最大抑制(NMS)来获得最终的目标检测集。我们提出的方法利用上述的能力,回归和分类边界框,以执行多目标跟踪。

跟踪器

The challenge of multi-object tracking is to extract the spatial and temporal positions, i.e., trajectories, of k k k objects given a frame by frame video sequence. Such a trajectory is defined as a list of ordered object bounding boxes T k = { b t 1 k , b t 2 k , . . . } \mathcal{T}_k = \left\{\mathbf{b}^k_{t_1},\mathbf{b}^k_{t_2}, ... \right\} Tk={bt1k,bt2k,...}, where a bounding box is defined by its coordinates b t k = ( x , y , w , h ) \mathbf{b}^k_{t}=(x, y, w, h) btk=(x,y,w,h), and $t $ represents a frame of the video. We denote the set object bounding boxes in frame t t t with B k = { b t k 1 , b t k 2 , . . . } \mathcal{B}_k = \left\{\mathbf{b}^{k_1}_{t}, \mathbf{b}^{k_2}_{t}, ... \right\} Bk={btk1,btk2,...}. Note, that each T k \mathcal{T}_k Tk or B t \mathcal{B}_t Bt can contain less elements than the total number of frames or trajectories in a sequence, respectively. At t = 0 t = 0 t=0, our tracker initializes tracks from the first set of detections D 0 = { d 0 1 , d 0 2 , . . . } = B 0 \mathcal{D}_0 = \left\{\mathbf{d}^{1}_{0}, \mathbf{d}^{2}_{0}, ... \right\}=\mathcal{B}_0 D0={d01,d02,...}=B0. In Figure 1, we illustrate the two subsequent processing steps (the nuts and bolts of our method) for a given frame t for all t > 0 t > 0 t>0, namely, the bounding box regression and track initialization.

多目标跟踪的挑战是,给定一帧一帧的视频序列,从中提取 k k k 个目标的空间和时间位置,即轨迹。这样的轨迹被定义为有序的目标包围框 T k = { b t 1 k , b t 2 k , . . . } \mathcal{T}_k = \left\{\mathbf{b}^k_{t_1},\mathbf{b}^k_{t_2}, ... \right\} Tk={bt1k,bt2k,...},其中边界框由坐标 b t k = ( x , y , w , h ) \mathbf{b}^k_{t}=(x, y, w, h) btk=(x,y,w,h) 定义, t t t 表示视频的一帧。我们用 B k = { b t k 1 , b t k 2 , . . . } \mathcal{B}_k = \left\{\mathbf{b}^{k_1}_{t}, \mathbf{b}^{k_2}_{t}, ... \right\} Bk={btk1,btk2,...}。注意,每个 T k \mathcal{T}_k Tk B t \mathcal{B}_t Bt 所包含的元素可以分别小于序列中帧或轨迹的总数。在 t = 0 t = 0 t=0 时,我们的跟踪器从第一组检测开始初始化跟踪 D 0 = { d 0 1 , d 0 2 , . . . } = B 0 \mathcal{D}_0 = \left\{\mathbf{d}^{1}_{0}, \mathbf{d}^{2}_{0}, ... \right\}=\mathcal{B}_0 D0={d01,d02,...}=B0。在图1中,我们为所有 t > 0 t>0 t>0,即边界框回归和跟踪初始化。

Bounding box regression. The first step, denoted with blue arrows, exploits the bounding box regression to extend active trajectories to the current frame t t t. This is achieved by regressing the bounding box b t − 1 k \mathbf{b}^{k}_{t-1} bt1k of frame t − 1 t − 1 t1 to the object’s new position b t k \mathbf{b}^{k}_{t} btk at frame t t t. In the case of Faster R-CNN, this corresponds to applying RoI pooling on the features of the current frame but with the previous bounding box coordinates. Our assumption is that the target has moved only slightly between frames, which is usually ensured from high frame rates (see Section B.5 of the supplementary for a frame rate robustness evaluation of Tracktor).The identity is automatically transferred from the previous to the regressed bounding box, effectively creating a trajectory. This is repeated for all subsequent frames.
After the bounding box regression, our tracker considers two cases for killing (deactivating) a trajectory: (i) an object leaving the frame or occluded by a non-object is killed if its new classification score s t k s^{k}_{t} stk is below σ a c t i v e \sigma_{active} σactive and (ii) occlusions between objects are handled by applying nonmaximum suppression (NMS) to all remaining B t \mathcal{B}_t Bt and their corresponding scores with an Intersection over Union (IoU) threshold λ a c t i v e \lambda_{active} λactive.

边界框回归。第一步(蓝色箭头所示)利用边界框回归将活动轨迹扩展到当前帧 t t t。这是通过将 t − 1 t-1 t1 时刻的边界框 b t − 1 k \mathbf{b}^{k}_{t-1} bt1k 回归到对象在 t t t 帧的新位置 b t k \mathbf{b}^{k}_{t} btk 来实现的。在更快的R-CNN的情况下,这相当于在当前帧的特征上应用RoI池,但使用之前的边框坐标。我们的假设是目标在帧间仅轻微移动,这通常是由高帧率来保证的(关于Tracktor帧率鲁棒性评估的补充章节B.5)。标识自动从之前的边界框转移到回归的边界框,有效地创建了一个轨迹。这对所有随后的帧都是重复的。
边界框回归之后,我们的追踪器考虑两种销毁追踪轨迹的方法:1)目标移动到画面之外,或被其他物体遮挡,此时它新的分类 s t k s^{k}_{t} stk 得分低于 σ a c t i v e \sigma_{active} σactive ;2)目标之间发生互相遮挡,这种情况将所有剩余 B t \mathcal{B}_t Bt 及其对应的分数与一个IoU 阈值 λ a c t i v e \lambda_{active} λactive 做一个非极大值抑制(NMS) 。

Bounding box initialization. In order to account for new targets, the object detector also provides the detections D t \mathcal{D}_t Dt for the entire frame t t t. This second step, indicated in Figure 1 with red arrows, is analogous to the first initialization at t = 0 t = 0 t=0. But a detection from D t \mathcal{D}_t Dt starts a trajectory only if the IoU with any of the already active trajectories b t k \mathbf{b}^k_t btk is smaller than λ n e w \lambda_{new} λnew. That is, we consider a detection for a new trajectory only if it is covering a potentially new object that is not explained by any trajectory. It should be noted again that our Tracktor does not require any tracking specific training or optimization and solely relies on an object detection method. This allows us to directly benefit from improved object detection methods and, most importantly, enables a comparatively cheap transfer to different tracking datasets or scenarios in which no ground truth tracking but only detection data is available.

边界框的初始化。为了能记录新的目标,目标检测器还给出第 t t t 帧的检测结果 D t \mathcal{D}_t Dt 。这是第二步,如图1中红色箭头所示,类似于 t = 0 t = 0 t=0 时的第一个初始化。但是 D t \mathcal{D}_t Dt 的检测只有当任何一个已经活动的轨道 b t k \mathbf{b}^k_t btk 的IoU小于 λ n e w \lambda_{new} λnew 时才会开始一个轨道。也就是说,只有在当前的检测结果中的目标没有任何一个轨迹能和它匹配的时候才会开一个新的轨迹。需要再次指出的是,我们的跟踪器不需要针对跟踪进行专门的训练或者优化,它只依赖于目标检测方法。这使我们能够直接受益于改进的目标检测方法,最重要的是,能够相对廉价地传输到不同的跟踪数据集或场景,在这些场景中,没有标记的信息跟踪信息,只有用于目标检测的数据。

跟踪扩展

In this section, we present two straightforward extensions to our vanilla Tracktor: a motion model and a reidentification algorithm. Both are aimed at improving identity preservation across frames and are common examples of techniques used to enhance, e.g., graph-based tracking methods [39, 62, 35].

在本节中,我们将对我们的原生跟踪器进行两个简单的扩展:运动模型和再识别算法。两者都旨在改进在不同的帧间实现身份保存,这两者都是增强技术的一般例子,用来提高例如基于图的跟踪方法[39,62,35]的性能。

Motion model. Our previous assumption that the position of an object changes only slightly from frame to frame does not hold in two scenarios: large camera motion and low video frame rates. In extreme cases, the bounding boxes from frame t − 1 t - 1 t1 might not contain the tracked object in frame t t t at all. Therefore, we apply two types of motion models that will improve the bounding box position in future frames. For sequences with a moving camera, we apply a straightforward camera motion compensation (CMC) by aligning frames via image registration using the Enhanced Correlation Coefficient (ECC) maximization as introduced in [16]. For sequences with comparatively low frame rates, we apply a constant velocity assumption (CVA) for all objects as in [11, 2].

运动模型。 我们之前的假设是,物体的位置从一帧到另一帧只是轻微的变化,这在两种情况下这个假设符合实际:一是摄像机抖动很厉害,二是视频采集的帧率很低。在极端情况下,第 t − 1 t - 1 t1 帧的画面里可能根本不包含第 t t t 帧中被跟踪的对象。因此,我们应用了两种运动模型来改善边框在未来帧未来帧中的位置。对于摄像机大抖动的视频,我们使用[16]中引入的增强相关系数(ECC)最大化算法对帧间图像进行配准,来直接补偿相机抖动产生的影响。对于帧率相对较低的视频,我们对所有对象应用恒定速度假设(CVA),如[11,2]。

Re-identification. In order to keep our tracker online, we suggest a short-term re-identification (reID) based on appearance vectors generated by a Siamese neural network [6, 25, 54]. To that end, we store killed (deactivated) tracks in their non-regressed version b t − 1 k \mathbf{b}^k_{t−1} bt1k for a fixed number of F r e I D F_{reID} FreID frames. We then compare the distance in the embedding space of the deactivated with the newly detected tracks and re-identify via a threshold. The embedding space distance is computed by a Siamese CNN and appearance feature vectors for each of the bounding boxes. It should be noted that the reID network is indeed trained on tracking ground truth data. To minimize the risk of false reIDs, we only consider pairs of deactivated and new bounding boxes with a sufficiently large IoU. The motion model is continuously applied to the deactivated tracks.

重新识别。 为了保持我们的跟踪器在线,我们提出基于用孪生神经网络生成的外观向量进行短期再识别(reID)[6,25,54]的方法。为此,我们在其非回归版本 b t − 1 k \mathbf{b}^k_{t−1} bt1k 中为一定数量( F r e I D F_{reID} FreID)的帧画面存储被清理(停用)的追踪目标。然后我们将未激活的追踪目标与新检测到的目标在嵌入空间中的距离进行比较,并通过阈值进行重新识别。嵌入空间距离由一个孪生CNN和每个包围框的外观特征向量计算。值得注意的是,reID网络确实是在标记过的跟踪数据上进行训练的。为了最小化虚假reIDs的风险,只有未激活和新的边界框之间有足够大的IoU才考虑匹配。这个运动模型持续在停用的目标上使用。

实验

消融研究(Ablation study)


Ablation study:去掉模型或者算法中的一些特征,观察去除的特征对模型产生了什么影响。即,ablation study就是你在同时提出多个思路提升某个模型的时候,为了验证这几个思路分别都是有效的,做的控制变量实验的工作。


基准评估

分析

The superior performance of our tracker without any tracking specific training or optimization demands a more thorough analysis. Without sophisticated tracking methods, it is not expected to excel in crowded and occluded, but rather only in benevolent, tracking scenarios. Which begs the question whether more common tracking methods fail to specifically address these complex scenarios as well. Our experiments and the subsequent analysis ought to demonstrate the strengths of our approach for easy tracking scenarios and motivate future research to focus on remaining complex tracking problems. In particular, we question the common execution of tracking-by-detection and suggest a new tracking paradigm. The subsequent analysis is conducted on the MOT17 training data and we compare all top performing methods with publicly shared data.

我们的跟踪器性能优越,无需专门针对跟踪进行任何的训练或者优化。(为什么这样)需要更深入的分析。不使用复杂的跟踪方法,它就不可能在拥挤和闭塞的情况下表现出色,而只是在非常好的跟环境下(得到优越的性能)。这就引出了一个问题,更常见的跟踪方法是否也不能专门解决这些复杂的场景。我们的实验和随后的分析应该能够证明我们的方法在简单跟踪场景中的优势,并激励未来的研究者将精力集中解决其他的复杂跟踪问题上。特别指出,我们对逐个检测跟踪的常见执行方式提出了质疑,并提出了一种新的跟踪范式。接下来的分析是在MOT17训练数据上进行的,我们将所有表现最好的方法与公开共享的数据进行比较。

跟踪的挑战

For a better understanding of our tracker, we want to analyse challenging tracking scenarios and compare its strengths and weaknesses to other trackers. To this end, we summarize their fundamental characteristics in Table 3. For a better understanding of our tracker, we want to analyse challenging tracking scenarios and compare its strengths and weaknesses to other trackers. To this end, we summarize their fundamental characteristics in Table 3.

为了更好地理解我们的跟踪器,我们想分析具有挑战性的跟踪场景,并将其与其他跟踪器的优缺点进行比较。为此,我们在表3中总结了它们的基本特征。为了更好地理解我们的跟踪器,我们想分析具有挑战性的跟踪场景,并将其与其他跟踪器的优缺点进行比较。为此,我们在表3中总结了它们的基本特征。

Object visibility. Intuitively, we expect diminished tracking performance for object-object or object-non-object occlusions, i.e., for targets with diminished visibility. In Figure 2, we compare the ratio of successfully tracked bounding boxes with respect to their visibility. The transparent red bar indicates the occurrences of ground truth bounding boxes for each visibility, and illustrates the proportionate impact on the overall performance of the trackers. Our method achieves superior performance even for partially occluded bounding boxes with visibilities as low as 0.3. Neither the identify preserving aspects of MHT DAM and MOTDT17 [9] nor the offline interpolation capabilities of MHT DAM and jCC seem to successfully tackle highly occluded objects. The high MOTA values in Table 2 are largely due to the unbalanced distribution of ground truth visibilities. As expected, our extended version only achieves minor improvements over our vanilla Tracktor.

对象的可见性。 直观地说,我们期望对物体-物体或物体-非物体遮挡的跟踪性能降低,即对能见度降低的目标。在图2中,我们比较了成功跟踪的边界框与其可见性的比率。透明的红色条表示每个可见性出现的地面真值边界框,并说明对跟踪器整体性能的比例影响。我们的方法即使对于能见度低至0.3的部分闭塞的包围框也能获得卓越的性能。MHT DAM和MOTDT17[9]的识别保持功能以及MHT DAM和jCC的离线插值能力似乎都不能成功地处理高度遮挡的物体。表2中的高MOTA值很大程度上是由于地真能见度分布不平衡造成的。正如预期的那样,我们的扩展版本只实现了我们的原生跟踪器的小改进。

Object size. In view of the large fraction of visible but not tracked objects in Figure 2, we argue that the trackability of an object is not only dependent on its visibility, but also its size. Therefore, we conduct the same comparison as for the visibility but for the size of an object. In the first row of Figure 3, we assume the height of a pedestrian to be proportional to its size and compare on all three MOT17 public detection sets. All methods performed similarly well for object heights larger than 250 pixels. To demonstrate their shortcomings even for highly visible objects, we only compare objects with a visibility larger than 0.9. As expected, the trackability of an object decreases drastically with its size across all three detection sets. Our tracker shows its strength in compensating for insufficient DPM and Faster R-CNN detections for all object sizes. All methods except MOTDT17 benefit from the additional small detections provided by SDP. For our tracker this is largely due to the Feature Pyramid Network extension of our Faster-RCNN detector. However, the learned appearance model and reID of the online MOTDT17 method seem generally vulnerable to small detections. Appearance models generally suffer from small object sizes and few observed pixels. In conclusion, except from our compensation of inferior detections none of the trackers exhibit a notably better performance with respect to varying object sizes.

目标尺寸。 鉴于图2中大量可见但未被跟踪的对象,我们认为目标的可跟踪性不仅取决于其可见性,还取决于其大小。因此,我们对可见性进行同样的比较,只是对目标的大小进行比较。在图3的第一行中,我们假设行人的高度与其大小成比例,并在所有三个MOT17公共检测集上进行比较。所有方法在物体高度大于250像素的情况下都表现得很好。为了证明它们的缺点,即使是在高可见的目标上,我们只比较了能见度大于0.9的目标。正如预期的那样,在所有三个检测集上,目标的可跟踪性随其大小急剧下降。我们的跟踪器显示了它在补偿DPM不足和更快的R-CNN检测所有目标大小方面的优势。除了modt17之外,所有的方法都受益于SDP提供的额外的小检测。对于我们的跟踪器,这主要是由于我们的Faster-RCNN检测器的金字塔网络扩展特性。然而,在线MOTDT17方法的学习到的外观模型和reID似乎通常容易受到小的检测。外观模型通常存在目标尺寸小和观测像素少的问题。总之,除了我们对劣等检测的补偿之外,没有一个跟踪器在不同的目标大小方面表现出明显更好的性能。

Robustness to detections. The performance of trackingby-detection methods with respect to visibility and size is inherently limited by the robustness of the underlying detection method. However, as observed for the object size, trackers differ in their ability to cope with, or benefit from, varying quality of detections. In the second row of Figure 3, we quantify this ability in terms of detection gaps on their coverage by the tracker. We define a detection gap as part of a ground truth trajectory that was at least once detected, and compare coverage of each gap vs. the gap length. Intuitively, long gaps are harder to compensate for, as the online or offline tracker has to perform a longer hallucination or interpolation, respectively. We indicated the occurrences of gap lengths over the respective set of detections in transparent red. For DPM and Faster R-CNN detections, two solutions lead to notable gap coverage: (i) offline interpolation such as in jCC, or (ii) motion prediction with Kalman filter and reID as in MOTDT. Compared to the graph-based jCC method, the online MOTDT17 method excels at covering particularly long gaps. However, none of these dedicated tracking methods yields similar robustness to our frame by frame regression tracker, which achieves far superior coverage. This holds especially true for long detection gaps with more than 15 frames. Offline methods benefit the most from improved SDP detections and neither our nor the MOTDT17 tracker convince with a notable gap length robustness.

鲁棒性检测。 基于可见性和大小的跟踪检测方法的性能天生受到底层检测方法的鲁棒性的限制。然而,正如观察到的对象大小,跟踪器的能力不同,以应对,或受益于不同的质量的检测。在图3的第二行中,我们根据跟踪器覆盖范围上的检测缺口来量化这种能力。我们将检测间隙定义为至少检测过一次的地面真实轨迹的一部分,并将每个间隙的覆盖范围与间隙长度进行比较。直觉上,较长的间隙很难补偿,因为在线跟踪器和离线跟踪器必须分别执行较长的幻觉或插值。我们用透明红色表示了间隙长度在各自检测集上的发生情况。对于DPM和更快的R-CNN检测,两种解决方案导致显著的间隙覆盖:(i)离线插值,如在jCC,或(ii)运动预测与卡尔曼滤波器和reID,如在MOTDT。与基于图形的jCC方法相比,在线的MOTDT17方法在覆盖特别长的空白方面表现出色。然而,这些专门的跟踪方法都没有产生类似于我们的逐帧回归跟踪器的鲁棒性,它实现了远远优越的覆盖。这对于超过15帧的长时间检测间隔尤其正确。离线方法从改进的SDP检测中获益最大,并且无论是我们的还是MOTDT17跟踪器都没有令人信服的间隙长度鲁棒性。

Identity preservation. The results of our Tracktor++ summarized in Table 2 indicate an identity preservation performance in terms of IDF1 and identity switches comparable with dedicated tracking methods. This is achieved without any offline graph optimization as in jCC [30] or eHAF [58]. In particular, MOTDT17, which applies a sophisticated appearance model and reID, is not substantially superior to our regression tracker and its comparatively simple extensions. However, our method excels in reducing the number of false positives in MOT17 as well as MOT16. In addition, we have shown that our Tracktor is capable of incorporating additional identity preserving extension.

身份保护。表2中总结的Tracktor++的结果表明,在IDF1和身份切换方面,与专用跟踪方法相比,身份保存性能相当。这是在没有任何离线图优化的情况下实现的,如jCC[30]或eHAF[58]。特别是,modt17,它应用了一个复杂的外观模型和reID,并没有实质上优于我们的回归跟踪器和它相对简单的扩展。然而,我们的方法在减少mo17和mo16的假阳性数量方面有优势。此外,我们已经证明了我们的跟踪器能够整合额外的身份保持扩展。

牛逼的追踪器应该长啥样(Oracle Tracktors)

We have shown that none of the dedicated tracking methods specifically targets challenging tracking scenarios, i.e., objects under heavy occlusions or small objects. We therefore want to motivate our Tracktor as a new tracking paradigm. To this end, we analyse our performance twofold: (i) the impact of the object detector on the killing policy and bounding box regression, (ii) identify performance upper bounds for potential extensions to our Tracktor. In Table 4, we present several oracle trackers by replacing parts of our algorithm with ground truth information. If not mentioned otherwise, all other tracking aspects are handled by our vanilla Tracktor. Their analysis should provide researchers with useful insights regarding the most promising research directions and extensions of our Tracktor.

我们已经证明,没有一种专用的跟踪方法来针对具有挑战性的跟踪场景,即严重遮挡下的目标或小目标。因此,我们想促使我们的跟踪器向新的跟踪范式转变。为此,我们从两方面对其性能进行分析:1)目标检测器对销毁策略和边界框回归的影响,2)我们的追踪器加上潜在的扩展功能后,性能可以提升到什么程度。在表4中,我们用标记过的信息替换了部分算法,给出了几个以牛逼(Oracle)命名的跟踪器。如果没有提到其他方面,所有其他跟踪方法都采用和我们的原生跟踪器(Tracktor)一样的方法。对这些追踪器的的分析应该为研究人员提供有用的见解,关于最有前途的研究方向和我们的跟踪器的扩展。

Detector oracles. To simulate a potentially perfect object detector, we introduce two oracles:

  • Oracle-Kill: Instead of killing with NMS or classification score we use ground truth information.
  • Oracle-REG: Instead of regression, we place the bounding boxes at their ground truth position.

牛逼检测器。 为了模拟一个可能完美的目标检测器,我们引入了两个牛逼版本:

  • Oracle-Kill:我们使用标记过(GT)的信息,而不是NMS或分类得分;
  • Oracle-REG:我们将边界框置于(GT)位置,而不是回归。

Both oracles yield substantial improvements with respect to MOTA and FP. However, killing by ground truth instead of score deteriorates identity preservation as the regression struggles with otherwise unseen bounding boxes.

这两种牛逼检测器都在MOTA和FP方面有了实质性的改进。然而,采用GT来销毁跟踪的目标比用分类得分销毁跟踪目标会导致身份保存性能变差,因为回归与看不见的边界框斗争。

Extension oracles. It should be noted, that Tracktor++ with non-perfect extensions already compensates for some of the detector’s insufficiencies. The reID and motion model (MM) oracles simulate potential additional performance gains. In order to remain online, these exclude any form of hindsight tracking-gap interpolation.

  • Oracle-MM: A motion model places each bounding box at the center of the ground truth in the next frame.
  • Oracle-reID: Re-identification is performed with ground truth identities.

牛逼的扩展。需要注意的是,Tracktor++具有不完美的扩展,已经弥补了检测器的一些不足。reID和motion model (MM) oracle模拟了潜在的额外性能增益。为了保持在线,这些方法排除了任何形式的后见之明跟踪间隙插值。

  • Oracle-MM:一个运动模型将每个边界框放置在下一帧的ground truth的中心。
  • Oracle-reID:使用GT身份进行重新识别。

As expected, both oracles improve IDF1 and identity switches substantially. The combined Oracle-MM-reID represents the extension upper bound of Tracktor++.

正如预期的那样,两款牛逼版本都大大改进了IDF1和身份切换。Oracle-MM-reID组合代表Tracktor++的扩展上限。

Omniscient oracle. Oracle-ALL performs ground truth killing, regression and reID. We consider its top MOTA of 72.2%, in combination with a high IDF1 and virtually no false positives, as the absolute upper bound of Tracktor with a Faster R-CNN and FPN object detector.

牛到爆。Oracle-ALL执行ground truth killing、regression和reID。我们认为其最高MOTA为72.2%,结合高IDF1和几乎没有假阳性,作为具有更快的R-CNN和FPN目标检测器的Tracktor的绝对上限。

The substantial performance gains from Oracle-MM indicate the potential of extending Tracktor with a sophisticated motion model. In particular, Oracle-MM-reIDINTER suggests a predictive motion model which hallucinates the position of an object through long occlusions. Such a motion model avoids offline post processing and additional false positives from wrong linear occlusion paths caused by long detection gaps and camera movement.

从Oracle-MM获得的大量性能提升表明,使用复杂的运动模型扩展Tracktor的潜力。特别是,Oracle-MM-reIDINTER提出了一种预测运动模型,该模型通过长时间遮挡产生物体位置的幻觉。这样的运动模型避免了离线后处理和由长检测间隙和相机运动引起的错误线性遮挡路径的额外误报。

迈向一种新的跟踪范式

To conclude our analysis we propose two approaches on how to utilize Tracktor as a starting point for future research directions:

为了总结我们的分析,我们就如何利用Tracktor作为未来研究方向的起点提出了两种方法:

Tracktor with extensions. Apply Tracktor to a given set of detections and extend it with tracking specific methods. Scenarios with large and highly visible objects will be covered by the frame to frame bounding box regression. For the remaining, it seems most promising to implement a hallucinating motion model, taking into account the individual movements of objects. In addition, such a motion predictor reduces the necessity for an advanced killing policy.

Tracktor与扩展。将Tracktor应用到给定的检测集,并通过跟踪特定方法扩展它。具有大型和高度可见对象的场景将被帧到帧边界框回归覆盖。至于剩下的,似乎最有希望的是实现一个幻觉运动模型,考虑到物体的个体运动。此外,这样的运动预测器减少了高级杀伤策略的必要性。

Tracklet generation. Analogous to tracking-by-detection, we propose a tracking-by-tracklet approach. Indeed, many algorithms already use tracklets as input [24, 65], as they are richer in information for computing motion or appearance models. However, usually a specific tracking method is used to create these tracklets.We advocate the exploitation of the detector itself, not only to create sparse detections, but frame to frame tracklets. The remaining complex tracking cases ought to be tackled by a subsequent tracking method.

Tracklet生成。与检测跟踪类似,我们提出了一种跟踪跟踪小轨的方法。事实上,许多算法已经使用tracklet作为输入[24,65],因为它们在计算运动或外观模型方面的信息更丰富。然而,通常使用特定的跟踪方法来创建这些tracklet。我们提倡利用检测器本身,不仅创建稀疏检测,而且创建帧对帧的tracklet。其余复杂的追踪案件应由随后的追踪方法处理。

In this work, we have formally defined those hard cases, analyzing the situations in which not only our method but other dedicated tracking solutions fail. And by doing so, we question the current focus of research in multi-object tracking, in particular, the missing confrontation with challenging tracking scenarios.

在这项工作中,我们正式定义了那些困难的案例,分析了我们的方法和其他专用跟踪解决方案失败的情况。通过这样做,我们对当前多目标跟踪研究的焦点提出了质疑,特别是,缺少对抗与具有挑战性的跟踪场景。

结论

We have shown that the bounding box regressor of a trained Faster-RCNN detector is enough to solve most tracking scenarios present in current benchmarks. A detector converted to Tracktor needs no specific training on tracking ground truth data and is able to work in an online fashion. In addition, we have shown that our Tracktor is extendable with re-identification and camera motion compensation, providing a substantial new state-of-the-art on the MOTChallenge. We analyzed the performance of multiple dedicated tracking methods on challenging tracking scenarios and none yielded substantially better performance compared to our regression based Tracktor. We hope this work establishes a new tracking paradigm, utilizing the object detector’s full capabilities.

我们已经证明,训练有素的快速rcnn检测器的边界框回归器足以解决当前基准中出现的大多数跟踪场景。探测器转换为跟踪器不需要专门训练就能跟踪地面真实数据,并且能够在线工作。此外,我们已经证明,我们的跟踪器是可扩展的重新识别和相机运动补偿,提供了一个实质性的最新技术的运动挑战。我们分析了多种专用跟踪方法在具有挑战性的跟踪场景下的性能,没有一种产生比我们基于回归的跟踪器更好的性能。我们希望这项工作建立一个新的跟踪范式,利用目标探测器的全部能力。

本文标签: 笔记Trackingbellswhistles