DSST(Accurate Scale Estimation for Robust Visual Tracking 代码解读

编程入门行业动态更新时间:2024-10-24 23:16:39

DSST(Accurate Scale Estimation for Robust Visual Tracking <a href=https://www.elefans.com/category/jswz/34/1771412.html style= 代码解读"/>

DSST(Accurate Scale Estimation for Robust Visual Tracking 代码解读

转载：

Accurate Scale Estimation for Robust Visual Tracking

我在前面一篇博客“相关滤波跟踪（MOSSE）”中讲了相关滤波跟踪的原理，但是因为那篇文章没有提供代码，所以就没法深入的研究他，而且纯理论看起来会很枯燥。后来Martin Danelljan 对MOSST做了改进，并增加了多尺度跟踪，改进效果很显著，在今年的VOT上，其测试效果是第一的。其文章名为Accurate Scale Estimation for Robust Visual Tracking，其代码为DSST，因此后面就用DSST代表这种方法。我把文章代码都上传到这里访问密码 af88 或者是这里，大家可以下载。下面进入正文：

MOSSE(Visual Object Tracking using Adaptive Correlation Filters )在求解滤波器时，其输入项是图像本身（灰度图），也就是图像的灰度特征。对于灰度特征，其特征较为简单，不能很好的描述目标的纹理、边缘等形状信息，因此DSST的作者将灰度特征替换为在跟踪和识别领域较为常用的HOG特征。

DSST作者将跟踪分为两个部分，位置变化（translation）和尺度变化（scale estimation）。在跟踪的实现过程中，作者定义了两个correlation filter，一个滤波器（translation filter）专门用于确定新的目标所处的位置，另一个滤波器（scale filter）专门用于尺度评估。

在translation filter方面，作者的方法与MOSSE的方法是一样的，只不过其获取最佳模板H的准则有了些许变化。根据translation filter可以获取当前帧目标所处的位置，然后在当前目标位置获取不同尺度的候选框，经过scale filter之后，确定新的目标尺度。

程序实现：

先来看看作者给出的伪代码：

Algorithm 1 Proposed tracking approach :iteration at step t.
Input :Image .Precious target position and scale .Translation model ,and scale model ,.
Output :Estimated target position  and scale .Updated translation model ,and scale model ,.Translation estimation :1：Extract a translation sample from at and .2：Compute the translation correlation using ,and in (6).3：Set to the target position that maximizes .Scale estimation :4：Extract a translation sample from at and .5：Compute the translation correlation using ,and in (6).6：Set to the target position that maximizes .Model update :7：Extract samples and from at  and .8：Update the translation model ,using (5).9：Update the scale model ,using (5).1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

初始化阶段：

一、得到translation filter的输入和输出

% desired translation filter output (gaussian shaped), bandwidth proportional to target size
%prod（X）表示对X中的每个元素求积（product）
%ndgrid将两个向量复制到rs（行rows） 和cs（cols）中 
output_sigma = sqrt(prod(base_target_sz)) * output_sigma_factor;
[rs, cs] = ndgrid((1:sz(1)) - floor(sz(1)/2), (1:sz(2)) - floor(sz(2)/2));
y = exp(-0.5 * (((rs.^2 + cs.^2) / output_sigma^2)));
yf = single(fft2(y)); %将矩正变为单精度浮点型1
2
3
4
5
6
7
1
2
3
4
5
6
7

已知初始目标框的为 (x0,y0,w0,h0) ，作者首先获取不同大小的patchs，获取过程如下：

[rs, cs] = ndgrid((1:sz(1)) - floor(sz(1)/2), (1:sz(2)) - floor(sz(2)/2));1
1

这里的sz(1)= 2h0 ，sz(2)= 2w0 。这里获取的patch大小如下表示：

patchrows{r|1−h0<r<h0} patchcols{c|1−w0<c<w0} 为方便理解，给出ndgrid的测试例程：

>> sz=[4,6]
sz =4     6
>> [rs, cs] = ndgrid((1:sz(1)) - floor(sz(1)/2), (1:sz(2)) - floor(sz(2)/2))
rs =-1    -1    -1    -1     -1    -10     0     0     0     0     01     1     1     1     1     12     2     2     2     2     2
cs =-2    -1     0     1     2     3-2    -1     0     1     2     3-2    -1     0     1     2     3-2    -1     0     1     2     31
2
3
4
5
6
7
8
9
10
11
12
13
14
1
2
3
4
5
6
7
8
9
10
11
12
13
14

由上面的例程可看出，对于[rs,cs]=ndgrid(a,b)得到的结果是rs和cs都是m*n的矩阵，其中m等于a的维度，n等于b的维度。如果去rs和cs对应位置的点，构成一个组合，则能够取边所有可能的组合。

在MOSSE中讲过，相关滤波器的求解公式：

H=GF
上面获取的这些patchs，再经过FFT之后，就是上式中的F，而G则是一个符合高斯分布的形状，并且峰值点F的中心。一般高斯分布是如下形式：
一维高斯分布函数：
G(x)=12πσ2−−−−√e−x22σ2
一维高斯分布函数：
G(x,y)=12πσ2e−(x2+y2)2σ2
作者程序中实现的方式为：

y = exp(-0.5 * (((rs.^2 + cs.^2) / output_sigma^2)));1
1

其中output_sigma= w0h0−−−−√∗factor ，factor是程序中给定的一个常数，作者取为1/16。上面得到的是输出结果，但是我们需要的是y的傅里叶变换形式，因此作者对y进行FFT变换：

yf = single(fft2(y))。1
1

二、得到scale filter的输入和输出

与得到translation filter的方法相类似：

% desired scale filter output (gaussian shaped), bandwidth proportional to number of scales
scale_sigma = nScales/sqrt(33) * scale_sigma_factor;
ss = (1:nScales) - ceil(nScales/2);
ys = exp(-0.5 * (ss.^2) / scale_sigma^2);
ysf = single(fft(ys));1
2
3
4
5
1
2
3
4
5

上面这段代码中，nScale代表给定的尺度的个数，这里nScale=33，scale_sigma_factor也是给定的参数，这里scale_sigma_factor=1/4。

三、获取hann window

这里获取hann window的目的是为了降低FFT变换时，图像边缘对变换结果的影响。
作者代码中用如下代码获取hann window：

cos_window = single(hann(sz(1)) * hann(sz(2))');1
1

获取一个sz(1)*sz(2)大小的cos_window框。
hann函数解释：hann(L)

w(n)=0.5(1−cos(2πnN)),0≤n≤N,N=L−1

>> hann(5)
ans =00.50001.00000.50000
>> hann(3)
ans =010
>> hann(5)*hann(3)'
ans =0         0          00     0.5000         00     1.0000         00     0.5000         00         0          0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

跟踪阶段

一、更新目标框的位置

% extract the test sample feature map for the translation filter
%得到的是样本的HOG特征图，并且用hann窗口减少图像边缘频率对FFT变换的影响
xt = get_translation_sample(im, pos, sz, currentScaleFactor, cos_window);
% calculate the correlation response of the translation filter
xtf = fft2(xt);
% find the maximum translation response
%这里是找到新到位置
response = real(ifft2(sum(hf_num .* xtf, 3) ./ (hf_den + lambda)));
[row, col] = find(response == max(response(:)), 1);
% update the position
pos = pos + round((-sz/2 + [row, col]) * currentScaleFactor);
positions(frame,:) = [pos target_sz];1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12

作者首先将当前2倍大小的目标框（im_patch）resize成第一帧时2倍目标框大小sz。再提取im_patch的hog特征图。
一共提取了32个sz(1)*sz(2)大小的hog特征，在用先前生成的cos_window与这32个特征图分别进行点乘操作，这样做的目的是为了减少图像边缘对FFT变换的影响。上面程序中xt就是经过cos_window操作后的hog特征图。其一共有28个hog特征，，每个hog特征都是sz(1)*sz(2)大小。
对xt进行FFT变换，得到变换后的结果xtf。
得到xtf后再用公式（对应论文中公式6）：

y=F−1{∑dl=1Al¯¯¯¯Zl}B+λ 得到候选框的相关性分数，找到最大响应最大位置，作为新的目标框的位置。

二、更新目标框的尺度变化

% extract the test sample feature map for the scale filter
xs = get_scale_sample(im, pos, base_target_sz, currentScaleFactor * scaleFactors, scale_window, scale_model_sz);
% calculate the correlation response of the scale filter
xsf = fft(xs,[],2);
scale_response = real(ifft(sum(sf_num .* xsf, 1) ./ (sf_den + lambda)));
% find the maximum scale response
recovered_scale = find(scale_response == max(scale_response(:)), 1);
% update the scale
currentScaleFactor = currentScaleFactor * scaleFactors(recovered_scale);
if currentScaleFactor < min_scale_factorcurrentScaleFactor = min_scale_factor;
elseif currentScaleFactor > max_scale_factorcurrentScaleFactor = max_scale_factor;
end
target_sz = floor(base_target_sz * currentScaleFactor);1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

作者在跟踪的过程中，将跟踪分为两个部分，位置变化和尺度变化。作者采取两个并行的相关性滤波器来分别对目标的位置变化和尺度变化进行评估。因此在代码实践上，这两个滤波器的实现方式很相似。但是有几点也不尽相同：
1、位移相关性滤波器（TF）在获取hog特征图时，是以2倍目标框大小的图像获取的。并且这个候选框只有一个，即上一帧确定的目标框。
而尺度相关性滤波器（SF）在获取hog特征图时，是以当前目标框的大小为基准，以33中不同的尺度获取候选框的hog特征图，即：

ss = (1:nScales) - ceil(nScales/2);1
1

其理论依据是：

patches=anW+anH
n∈{−S−12,...,S−12}
其中W和H分别代表目标框的宽度和高度，S代表尺度的个数。
2、SF的实践过程中，FFT（快速傅里叶变换）和IFFT（快速傅里叶反变换）都是一维变换，而TF则是二维空间的变换。

三、更新TF和SF的参数

% extract the training sample feature map for the translation filterxl=get_translation_sample(im, pos, sz, currentScaleFactor, cos_window);   % calculate the translation filter updatexlf = fft2(xl);new_hf_num = bsxfun(@times, yf, conj(xlf));new_hf_den = sum(xlf .* conj(xlf), 3); % extract the training sample feature map for the scale filterxs=get_scale_sample(im, pos, base_target_sz, currentScaleFactor * scaleFactors, scale_window, scale_model_sz);   % calculate the scale filter updatexsf = fft(xs,[],2);new_sf_num = bsxfun(@times, ysf, conj(xsf));new_sf_den = sum(xsf .* conj(xsf), 1); if frame == 1% first frame, train with a single imagehf_den = new_hf_den;hf_num = new_hf_num;        sf_den = new_sf_den;sf_num = new_sf_num;else% subsequent frames, update the modelhf_den = (1 - learning_rate) * hf_den + learning_rate * new_hf_den;hf_num = (1 - learning_rate) * hf_num + learning_rate * new_hf_num;sf_den = (1 - learning_rate) * sf_den + learning_rate * new_sf_den;sf_num = (1 - learning_rate) * sf_num + learning_rate * new_sf_num;end1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25