admin管理员组

文章数量:1596252

vin-slam中调用ceres库内部代码分析与性能优化

    • 1,vin-slam中后端参数优化调用流程代码
    • 2,ceres内部的求解流程(未完待续)

首先,很抱歉前几次上传的关于一些图像算法代码不全,主要是对这个csdn用法不太熟悉,有些东西遗漏了,如有兴趣可以加我微信yhtao923,我们可以交流一下。
本文对vin-slam一些算法原理不做介绍,有关这方面内容网络资源较多,大家可以搜索到很多相关内容。本文主要针对ceres库中被调用的关键代码做分析,对其涉及的矩阵结构进行调整以及借助一些平台向量指令进行优化,使得性能得到大幅度提升。注意,这里仅仅是ceres中关于非线性优化的运行性能,并不包括算法的执行效果,算法执行结果并不会改变。而这领域网上资源较少,大部分均未对ceres库中内部函数做进一步优化,这里我阅读了库函数内部代码,并根据slam数据结构特性进行了重新修改,其中优化思路也可以适用于其他类型的非线性优化场合。

1,vin-slam中后端参数优化调用流程代码

关于vin-slam中后端优化代码在目录VINS-Mono\vins_estimator\src中,代码文件为estimator.cpp,函数为optimization(),如下代码:

    /*此处为添加惯导部分参数块 */
    vector2double();
	
    //loss_function = new ceres::HuberLoss(1.0);
    loss_function = new ceres::CauchyLoss(1.0);
    for (int i = 0; i < WINDOW_SIZE + 1; i++)
    {
        ceres::LocalParameterization *local_parameterization = new PoseLocalParameterization();
        problem.AddParameterBlock(para_Pose[i], SIZE_POSE, local_parameterization);
        problem.AddParameterBlock(para_SpeedBias[i], SIZE_SPEEDBIAS);
    }
    for (int i = 0; i < NUM_OF_CAM; i++)
    {
        ceres::LocalParameterization *local_parameterization = new PoseLocalParameterization();
        problem.AddParameterBlock(para_Ex_Pose[i], SIZE_POSE, local_parameterization);
        if (!ESTIMATE_EXTRINSIC)
        {
            ROS_DEBUG("fix extinsic param");
            problem.SetParameterBlockConstant(para_Ex_Pose[i]);
        }
        else
        {
            ROS_DEBUG("estimate extinsic param");
        }
    }

vins-slam中默认摄像头个数为1,滑动窗口为10个,宏定义NUM_OF_CAM为1,WINDOW_SIZE为10.
因此参数包括10个位置,姿态,速度,加速度和角速度的bias的数据,(3+4+3+3+3) * 11个参数,参数块为pose,para_SpeedBias,以及一个摄像头的外参数para_Ex_Pose,因此参数块一共为2 * 11 + 1 = 23个。
下一步给ceres的类添加惯导部分的残差快。

    /*添加惯导部分的残差块 */
    for (int i = 0; i < WINDOW_SIZE; i++)
    {
        int j = i + 1;
        if (pre_integrations[j]->sum_dt > 10.0)
            continue;
		
        IMUFactor* imu_factor = new IMUFactor(pre_integrations[j]);
        problem.AddResidualBlock(imu_factor, NULL, para_Pose[i], para_SpeedBias[i], para_Pose[j],       para_SpeedBias[j]);
    }
    /*添加视觉部分的残差块 */
    int f_m_cnt = 0;
    int feature_index = -1;
    for (auto &it_per_id : f_manager.feature)
    {
        it_per_id.used_num = it_per_id.feature_per_frame.size();
        if (!(it_per_id.used_num >= 2 && it_per_id.start_frame < WINDOW_SIZE - 2))
            continue;
        ++feature_index;
        int imu_i = it_per_id.start_frame, imu_j = imu_i - 1;
        Vector3d pts_i = it_per_id.feature_per_frame[0].point;
        for (auto &it_per_frame : it_per_id.feature_per_frame)
        {
            imu_j++;
            if (imu_i == imu_j)
            {
                continue;
            }
            Vector3d pts_j = it_per_frame.point;
            if (ESTIMATE_TD)
            {
                    ProjectionTdFactor *f_td = new ProjectionTdFactor(pts_i, pts_j, it_per_id.feature_per_frame[0].velocity, it_per_frame.velocity,
                                                                     it_per_id.feature_per_frame[0].cur_td, it_per_frame.cur_td,
                                                                     it_per_id.feature_per_frame[0].uv.y(), it_per_frame.uv.y());
                    problem.AddResidualBlock(f_td, loss_function, para_Pose[imu_i], para_Pose[imu_j], para_Ex_Pose[0], para_Feature[feature_index], para_Td[0]);
            }
            else
            {
                ProjectionFactor *f = new ProjectionFactor(pts_i, pts_j);
                problem.AddResidualBlock(f, loss_function, para_Pose[imu_i], para_Pose[imu_j], para_Ex_Pose[0], para_Feature[feature_index]);
            }
            f_m_cnt++;
        }
    }

下一步就可以构建ceres优化器:

    ceres::Solver::Options options;
    options.linear_solver_type = ceres::DENSE_SCHUR;
    options.trust_region_strategy_type = ceres::DOGLEG;
    options.max_num_iterations = NUM_ITERATIONS;
    if (marginalization_flag == MARGIN_OLD)
        options.max_solver_time_in_seconds = SOLVER_TIME * 4.0 / 5.0;
    else
        options.max_solver_time_in_seconds = SOLVER_TIME;
    TicToc t_solver;
    ceres::Solver::Summary summary;
    ceres::Solve(options, &problem, &summary);

从上面代码可以看出求解非线性优化器中的参数个数可以分类,一类为固定的惯导部分参数,为11组,另一组为视觉部分的para_Feature,个数不确定,一般几百个到上千个。
这里关键的求解函数ceres::Solve(options, &problem, &summary),为主要耗时的函数,因此下一节我们来分析ceres内部的求解函数流程。

2,ceres内部的求解流程(未完待续)

 求解主要函数ceres::Solve(options, &problem, &summary),在文件ceres-solver\internal\ceres\solver中

函数内部分为关键三个函数,前处理函数为preprocess(),优化函数为minimize(),后处理函数为PostSolveSummarize():
思维导图如下:

下面我们来逐个分析这三个大函数。
预处理函数preprocess(),关键代码分析
由于vins-mono选择了trust_region算法,因此主要调用了trust_region_preprocessor文件中的主函数:

bool TrustRegionPreprocessor::Preprocess(const Solver::Options& options,
                                         ProblemImpl* problem,
                                         PreprocessedProblem* pp) 

这个预处理函数中共初始话建立的四个部分的功能,SetupLinearSolver(),SetupEvaluator(),SetupInnerIterationMinimizer(),SetupMinimizerOptions()。
除了SetupLinearSolver()以外,其余三个函数均为对求解过程的一些参数进行初始化,对性能影响不大。因此我们着重分析SetupLinearSolver()。主要到此函数内部调用了函数ReorderProgram(),其注释为:

// Reorder the program to reduce fill in and improve cache coherency
// of the Jacobian.

意思为对雅可比结构进行重新排序,以提升cache一致性,但是实际作用不仅仅如此。这里我们针对雅可比矩阵特点对其重新排序后,可以为后续的舒尔消元带来极大的方便。ceres源码中ReorderProgram的排序为打乱了原始雅可比矩阵相对于参数的排序,笔者猜测是为了多线程运行的便捷(此处有疑点),因此对雅可比矩阵顺序进行打乱操作。
ReorderProgram()函数在文件trust_region_preprocessor中,其内部调用的分支中有三个函数,根据前面预设的参数条件,这里运行的是函数ReorderProgramForSchurTypeLinearSolver(),源码在文件reorder_program中,此处的排序函数为ComputeStableSchurOrdering()在文件parameter_block_ordering中,采用了类似g2o中的图优化对参数块进行排序。这里的重新排序的算法我们就不过多讨论了。笔者针对性能的提升,重新做了简单的排序。
根据前面slam里参数结构,固定参数块个数为23个,而视觉部分的参数对应的残差块数据个数为2,计算到海森矩阵中为1个元素,最终视觉部分在海森矩阵中会组成一个对角矩阵,这里舒尔消元过程中会当做一个大矩阵块作为参数来计算。因此我们把23个惯导以及摄像头外参数作为一个组,把视觉部分参数作为一组。那么重新排序代码为:

  vector<ParameterBlock*> schur_ordering;
  #if 0
    //const int size_of_first_elimination_group = \
        //ComputeStableSchurOrdering(*program, &schur_ordering);
  #else
    int size_of_first_elimination_group = 0;
	const vector<ParameterBlock*>& parameter_blocks = program->parameter_blocks();
	int param_blocks_size = parameter_blocks.size();
	
    for (int i = 23; i < param_blocks_size; ++i)
    {
	  ParameterBlock* parameter_block = parameter_blocks[i];
	  schur_ordering.push_back(parameter_block);
	  size_of_first_elimination_group++;
    }
	for (int i = 0; i < 23; ++i)
    {
	  ParameterBlock* parameter_block = parameter_blocks[i];
	  schur_ordering.push_back(parameter_block);
    }
   #endif

上述重新组合代码放在ReorderProgram文件的函数ReorderProgramForSchurTypeLinearSolver()中,代替了函数ComputeStableSchurOrdering()。注意:这里调整参数顺序是优化的重点改动地方。
对于后面舒尔消元以及求解方程的过程,比较重要的结构体为对雅可比稀疏矩阵的描述。这个在SetupMinimizerOptions这个函数里调用了如下代码来实现:

pp->minimizer_options.jacobian.reset(pp->evaluator->CreateJacobian());

其中CreateJacobian函数在文件block_jacobian_writer中。可以看出对于稀疏雅可比矩阵的描述采用了类CompressedRowBlockStructure。这个类主要分为行结构和列结构。其中列结构代表的为参数块个数,行结构为残差块个数。代码如下:

 bs->cols.resize(parameter_blocks.size());
  int cursor = 0;
  for (int i = 0; i < parameter_blocks.size(); ++i) {
    CHECK_NE(parameter_blocks[i]->index(), -1);
    CHECK(!parameter_blocks[i]->IsConstant());
    bs->cols[i].size = parameter_blocks[i]->LocalSize();
    bs->cols[i].position = cursor;
    cursor += bs->cols[i].size;
  }

其中bs->cols[i].size为第i个参数块中包含参数的个数,也就是这一块雅可比小矩阵的列数。bs->cols[i].position为第i个参数块的位置。
对于行结构的初始化,这里是基于列结构来计算行结构的信息,代码如下:
代码的分析增加在注释中。

// Construct the cells in each row.
  const vector<ResidualBlock*>& residual_blocks = program_->residual_blocks();
  int row_block_position = 0;
  /*残差块的个数为行信息的个数 */
  bs->rows.resize(residual_blocks.size());
  
  for (int i = 0; i < residual_blocks.size(); ++i) {
    const ResidualBlock* residual_block = residual_blocks[i];
    CompressedRow* row = &bs->rows[i];
    /* 每一个行块的大小为残差数据的维数 */
    row->block.size = residual_block->NumResiduals();
    /* 
    /* 每一个行块的索引 */
    row->block.position = row_block_position;
    row_block_position += row->block.size;
    
    // Size the row by the number of active parameters in this residual.
    /* 有的优化参数为常量,因此可能并非需要优化的参数 */
    const int num_parameter_blocks = residual_block->NumParameterBlocks();
    int num_active_parameter_blocks = 0;
    for (int j = 0; j < num_parameter_blocks; ++j) {
      if (residual_block->parameter_blocks()[j]->index() != -1) {
        num_active_parameter_blocks++;
      }
    }
    /* 每个行块中cells的个数,每个cells对应一个参数块的雅可比小矩阵 */
    row->cells.resize(num_active_parameter_blocks);
    
    /* 初始化每个cell的信息 */
    // Add layout information for the active parameters in this row.
    for (int j = 0, k = 0; j < num_parameter_blocks; ++j) {
      const ParameterBlock* parameter_block =
          residual_block->parameter_blocks()[j];
      if (!parameter_block->IsConstant()) {
        Cell& cell = row->cells[k];
        /* cell对应的参数块索引 */
        cell.block_id = parameter_block->index();
        /* position 为雅可比矩阵保存的内存中的起始地址, 这个position在后续计算舒尔消元过程中要经常用到*/
        cell.position = jacobian_layout_[i][k];
        // Only increment k for active parameters, since there is only layout
        // information for active parameters.
        k++;
      }
    }

    sort(row->cells.begin(), row->cells.end(), CellLessThan);
  }

求解函数Minimize的分析
Minimize()函数实体在 trust_region_minimizer中,正如前面所言,工程中初始化的求解类型为信任域算法。

void TrustRegionMinimizer::Minimize(const Minimizer::Options& options,
                                    double* parameters,
                                    Solver::Summary* solver_summary) {
  start_time_in_secs_ = WallTimeInSeconds();
  iteration_start_time_in_secs_ = start_time_in_secs_;
  /* 初始化求解函数参数 */
  Init(options, parameters, solver_summary);
  /*这里开始第一次调用雅可比以及残差计算部分 */
  RETURN_IF_ERROR_AND_LOG(IterationZero());

  // Create the TrustRegionStepEvaluator. The construction needs to be
  // delayed to this point because we need the cost for the starting
  // point to initialize the step evaluator.
  step_evaluator_.reset(new TrustRegionStepEvaluator(
      x_cost_,
      options_.use_nonmonotonic_steps
          ? options_.max_consecutive_nonmonotonic_steps
          : 0));
  iteration_start_time_in_secs_ = WallTimeInSeconds();
  while (FinalizeIterationAndCheckIfMinimizerCanContinue()) {
    

    const double previous_gradient_norm = iteration_summary_.gradient_norm;
    const double previous_gradient_max_norm =
        iteration_summary_.gradient_max_norm;
 
    iteration_summary_ = IterationSummary();
    iteration_summary_.iteration =
        solver_summary->iterations.back().iteration + 1;
    /* ComputeTrustRegionStep实现舒尔消元以及求解方程,并且计算计算delt更新数据,ComputeTrustRegionStep为整个迭代过程最重要得部分,也是最耗时的部分,优化重点在这里*/
    RETURN_IF_ERROR_AND_LOG(ComputeTrustRegionStep());

    if (!iteration_summary_.step_is_valid) {
      RETURN_IF_ERROR_AND_LOG(HandleInvalidStep());
      continue;
    }

    if (options_.is_constrained &&
        options_.max_num_line_search_step_size_iterations > 0) {
      // Use a projected line search to enforce the bounds constraints
      // and improve the quality of the step.
      DoLineSearch(x_, gradient_, x_cost_, &delta_);
    }

    ComputeCandidatePointAndEvaluateCost();
    DoInnerIterationsIfNeeded();

    if (ParameterToleranceReached()) {
      return;
    }

    if (FunctionToleranceReached()) {
      return;
    }

    if (IsStepSuccessful()) {
      RETURN_IF_ERROR_AND_LOG(HandleSuccessfulStep());
    } else {
      // Declare the step unsuccessful and inform the trust region strategy.
      iteration_summary_.step_is_successful = false;
      iteration_summary_.cost = candidate_cost_ + solver_summary_->fixed_cost;

      // When the step is unsuccessful, we do not compute the gradient
      // (or update x), so we preserve its value from the last
      // successful iteration.
      iteration_summary_.gradient_norm = previous_gradient_norm;
      iteration_summary_.gradient_max_norm = previous_gradient_max_norm;
      strategy_->StepRejected(iteration_summary_.relative_decrease);
    }
	
  }
}

以下为主函数minimizer下调用子函数结构:

先分析IterationZero()函数,这里主要调用函数为EvaluateGradientAndJacobian(),内部又调用了evaluator_->Evaluate(evaluate_options,
x_.data(),
&x_cost_,
residuals_.data(),
gradient_.data(),
jacobian_));
其中Evaluate调用的地方在文件program_evaluator.h中的Evaluate函数,首先调用函数PrepareForEvaluation为雅可比矩阵和残差矩阵相关小矩阵的内存布局。
调用ceres外部对雅可比和残差的计算部分如下代码:

// Evaluate the cost, residuals, and jacobians.
          double block_cost;
          if (!residual_block->Evaluate(
                  evaluate_options.apply_loss_function,
                  &block_cost,
                  block_residuals,
                  block_jacobians,
                  scratch->residual_block_evaluate_scratch.get())) {
            abort = true;
            return;
          }

这部分耗时对于整个求解过程的耗时占用较小,因此我们可以暂时不需要太多关注。而最耗时的为求解方程中的舒尔消元部分,schur_eliminator_impl.h文件中的Eliminate函数,最后SolveReducedLinearSystem,BackSubstitute这几个函数也较为耗时。这里我们主要优化Eliminate函数。首先优化后时间对比如下:

I0814 01:09:30.134727 127160 schur_complement_solver.cc:171] EliminateSlam 0.000228167
I0814 01:09:30.140404 127160 schur_complement_solver.cc:178] Eliminate 0.00563121
I0814 01:09:30.144488 127160 schur_complement_solver.cc:171] EliminateSlam 0.00022912
I0814 01:09:30.151427 127160 schur_complement_solver.cc:178] Eliminate 0.00400996
I0814 01:09:30.156728 127160 schur_complement_solver.cc:171] EliminateSlam 0.000275135
I0814 01:09:30.157441 127160 schur_complement_solver.cc:178] Eliminate 0.000660896
I0814 01:09:30.158865 127160 schur_complement_solver.cc:171] EliminateSlam 0.000221014
I0814 01:09:30.169198 127160 schur_complement_solver.cc:178] Eliminate 0.00068593
I0814 01:09:30.173861 127160 trust_region_minimizer.cc:136] cost 0.0395019

I0814 01:09:30.242223 127160 schur_complement_solver.cc:171] EliminateSlam 0.000225067
I0814 01:09:30.242919 127160 schur_complement_solver.cc:178] Eliminate 0.000641108
I0814 01:09:30.247565 127160 schur_complement_solver.cc:171] EliminateSlam 0.00022912
I0814 01:09:30.253784 127160 schur_complement_solver.cc:178] Eliminate 0.000674009
I0814 01:09:30.256924 127160 schur_complement_solver.cc:171] EliminateSlam 0.000255108
I0814 01:09:30.265875 127160 schur_complement_solver.cc:178] Eliminate 0.000684977
I0814 01:09:30.272472 127160 schur_complement_solver.cc:171] EliminateSlam 0.000262022
I0814 01:09:30.280575 127160 schur_complement_solver.cc:178] Eliminate 0.00079298
I0814 01:09:30.284862 127160 trust_region_minimizer.cc:136] cost 0.0431149

I0814 01:09:30.341929 127160 schur_complement_solver.cc:171] EliminateSlam 0.000477076
I0814 01:09:30.344331 127160 schur_complement_solver.cc:178] Eliminate 0.000635147
I0814 01:09:30.357460 127160 schur_complement_solver.cc:171] EliminateSlam 0.000243902
I0814 01:09:30.358187 127160 schur_complement_solver.cc:178] Eliminate 0.000662088
I0814 01:09:30.366591 127160 schur_complement_solver.cc:171] EliminateSlam 0.0018239
I0814 01:09:30.369652 127160 schur_complement_solver.cc:178] Eliminate 0.00064683
I0814 01:09:30.372679 127160 trust_region_minimizer.cc:136] cost 0.0314691
la
I0814 01:09:30.436864 127160 schur_complement_solver.cc:171] EliminateSlam 0.000241041
I0814 01:09:30.438820 127160 schur_complement_solver.cc:178] Eliminate 0.00144482
I0814 01:09:30.441287 127160 schur_complement_solver.cc:171] EliminateSlam 0.000237942
I0814 01:09:30.446676 127160 schur_complement_solver.cc:178] Eliminate 0.00332403
I0814 01:09:30.456245 127160 schur_complement_solver.cc:171] EliminateSlam 0.00421715
I0814 01:09:30.456990 127160 schur_complement_solver.cc:178] Eliminate 0.00067687
I0814 01:09:30.460294 127160 schur_complement_solver.cc:171] EliminateSlam 0.000252008
I0814 01:09:30.468626 127160 schur_complement_solver.cc:178] Eliminate 0.000689983
I0814 01:09:30.472404 127160 trust_region_minimizer.cc:136] cost 0.0359242

I0814 01:09:30.530846 127160 schur_complement_solver.cc:171] EliminateSlam 0.000265121
I0814 01:09:30.532482 127160 schur_complement_solver.cc:178] Eliminate 0.000962019
I0814 01:09:30.542966 127160 schur_complement_solver.cc:171] EliminateSlam 0.000476122
I0814 01:09:30.547399 127160 schur_complement_solver.cc:178] Eliminate 0.000714064
I0814 01:09:30.553222 127160 schur_complement_solver.cc:171] EliminateSlam 0.000302076
I0814 01:09:30.563824 127160 schur_complement_solver.cc:178] Eliminate 0.000738144
I0814 01:09:30.566916 127160 trust_region_minimizer.cc:136] cost 0.0364859

I0814 01:09:30.624837 127160 schur_complement_solver.cc:171] EliminateSlam 0.000313997
I0814 01:09:30.628013 127160 schur_complement_solver.cc:178] Eliminate 0.00311995
I0814 01:09:30.636418 127160 schur_complement_solver.cc:171] EliminateSlam 0.000257015
I0814 01:09:30.639873 127160 schur_complement_solver.cc:178] Eliminate 0.00140882
I0814 01:09:30.641511 127160 schur_complement_solver.cc:171] EliminateSlam 0.000247955
I0814 01:09:30.654429 127160 schur_complement_solver.cc:178] Eliminate 0.00766611
I0814 01:09:30.659863 127160 trust_region_minimizer.cc:136] cost 0.0354891

I0814 01:09:30.717357 127160 schur_complement_solver.cc:171] EliminateSlam 0.00029397
^CI0814 01:09:30.720774 127160 schur_complement_solver.cc:178] Eliminate 0.00069499
I0814 01:09:30.726763 127160 schur_complement_solver.cc:171] EliminateSlam 0.000328064
I0814 01:09:30.730413 127160 schur_complement_solver.cc:178] Eliminate 0.00172615
I0814 01:09:30.737879 127160 schur_complement_solver.cc:171] EliminateSlam 0.000255108
I0814 01:09:30.740783 127160 schur_complement_solver.cc:178] Eliminate 0.000646114

其中EliminateSlam 为针对slam项目优化后的时间单位为秒,Eliminate 为ceres默认代码的运行效果。可见优化后效果均能提升几倍,这里我们测试平台为x64平台Intel® Core™ i7-9700 CPU @ 3.00GHz 3.00 GHz,在单核的ubuntu虚拟机上运行,内存为4G。
首先借助于ceres的注释,介绍舒尔消元的基本算法原理。
实现SchurEliminatorBase接口的类实现线性最小二乘问题的变量消去法。
假设输入线性系统:

可以划分为:

其中x=[y,z]是变量的分区。分割变量的形式是,E’E是块对角矩阵。或换句话说,E中的参数块形成一个独立的集合由块矩阵A’A所暗示的图。那么这个部分提供计算Schur补功能

这里:

这是消除运算,即构造线性系统通过消除E。
算法还提供反转,z的值它可以通过求解线性系统

通过观察

矩阵A的行必须是重新排序,使得E为对角矩阵,F为其他矩阵。

为了简化说明,仅在这种情况下描述了具有空值D的情况。
通常的消除方法如下。从

我们可以形成正规方程:

将第一个方程的两边乘以(E’E)^(-1),然后乘以F’E,我们得到:

现在减去我们得到的两个方程

而不是形成正规方程并对其进行运算对于一般稀疏矩阵,这里的算法处理一个参数块一次以y表示。对应于单个行的行
参数块yi称为块,算法运行一次处理一个块。自20世纪90年代以来,数学一直保持不变
简化线性系统可以表示为简化线性系统的和每个块的线性系统。这可以通过观察两个事情。
。E’E是块对角矩阵。
当计算E’F时,仅计算单个块内的项,即当转置和相乘时用于y1列块
为了优化前和优化后结果一致,我们针对函数Eliminate重新在schur_eliminator_impl.h文件里写函数:

template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::EliminateSlam(
	const BlockSparseMatrixData& A,
	const double* b,
	const double* D,
	BlockRandomAccessMatrix* lhs,
	double* rhs)

注意对应的调用其他头文件里,例如schur_eliminator.h应添加对应的接口。
首先Eliminate函数被调用之前,必须进行一个初始话过程,即需要调用schur_eliminator_impl.h的init函数。在源码schur_complement_solver文件的SolveImpl函数中,代码如下:

eliminator_->Init(num_eliminate_blocks, kFullRankETE, bs);

这个函数初始化了两个结构体,lhs_row_layout_以及chunks。lhs_row_layout_保存了上面矩阵中的R矩阵中每一个雅可比矩阵组成的海森矩阵块的地址索引。chunks为向量,每一个成员chunk包括以下三个:

    int size;
    int start;
    BufferLayoutType buffer_layout;

其中size为参数块对应的残差行个数,start为每个参数块对应残差列行的起始位置。buffer_layout为E’F矩阵临时保存的内存区的小矩阵快布局。

我们再着重分析优化重点部分Eliminate函数,
第一部分为对R矩阵的对角元素增加拉姆达,再LM算法里会用到。

 // Add the diagonal to the schur complement.
  if (D != NULL) {
    ParallelFor(context_,
                num_eliminate_blocks_,
                num_col_blocks,
                num_threads_,
                [&](int i) {
                  const int block_id = i - num_eliminate_blocks_;
                  int r, c, row_stride, col_stride;
                  CellInfo* cell_info = lhs->GetCell(
                      block_id, block_id, &r, &c, &row_stride, &col_stride);
                  if (cell_info != NULL) {
                    const int block_size = bs->cols[i].size;
                    typename EigenTypes<Eigen::Dynamic>::ConstVectorRef diag(
                        D + bs->cols[i].position, block_size);

                    std::lock_guard<std::mutex> l(cell_info->m);
                    MatrixRef m(cell_info->values, row_stride, col_stride);
                    m.block(r, c, block_size, block_size).diagonal() +=
                        diag.array().square().matrix();
                  }
                });
  }

这部分耗时较小,因此暂时略过。
为了减少循环内部一些函数调用开销,我们重新做一个内存初始化:
int num_rows_ = 0;
const int num_blocks = blocks.size();
block_layout_.resize(num_blocks, 0);
for (int i = 0; i < 23; ++i)
{
block_layout_[i] = num_rows_;
num_rows_ += blocks[i];
}
用于代替lhs->GetCell函数的调用。
重点优化地方在循环内部,如下代码:

ParallelFor(
      context_,
      0,
      int(chunks_.size()),
      num_threads_,
      [&](int thread_id, int i) {
        double* buffer = buffer_.get() + thread_id * buffer_size_;
        const Chunk& chunk = chunks_[i];
        const int e_block_id = bs->rows[chunk.start].cells.front().block_id;
        const int e_block_size = bs->cols[e_block_id].size;

        VectorRef(buffer, buffer_size_).setZero();

        typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix ete(e_block_size,
                                                                  e_block_size);

        if (D != NULL) {
          const typename EigenTypes<kEBlockSize>::ConstVectorRef diag(
              D + bs->cols[e_block_id].position, e_block_size);
          ete = diag.array().square().matrix().asDiagonal();
        } else {
          ete.setZero();
        }

        FixedArray<double, 8> g(e_block_size);
        typename EigenTypes<kEBlockSize>::VectorRef gref(g.data(),
                                                         e_block_size);
        gref.setZero();

        // We are going to be computing
        //
        //   S += F'F - F'E(E'E)^{-1}E'F
        //
        // for each Chunk. The computation is broken down into a number of
        // function calls as below.

        // Compute the outer product of the e_blocks with themselves (ete
        // = E'E). Compute the product of the e_blocks with the
        // corresponding f_blocks (buffer = E'F), the gradient of the terms
        // in this chunk (g) and add the outer product of the f_blocks to
        // Schur complement (S += F'F).
        ChunkDiagonalBlockAndGradient(
            chunk, A, b, chunk.start, &ete, g.data(), buffer, lhs);

        // Normally one wouldn't compute the inverse explicitly, but
        // e_block_size will typically be a small number like 3, in
        // which case its much faster to compute the inverse once and
        // use it to multiply other matrices/vectors instead of doing a
        // Solve call over and over again.
        typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix inverse_ete =
            InvertPSDMatrix<kEBlockSize>(assume_full_rank_ete_, ete);

        // For the current chunk compute and update the rhs of the reduced
        // linear system.
        //
        //   rhs = F'b - F'E(E'E)^(-1) E'b

        if (rhs) {
          FixedArray<double, 8> inverse_ete_g(e_block_size);
          MatrixVectorMultiply<kEBlockSize, kEBlockSize, 0>(
              inverse_ete.data(),
              e_block_size,
              e_block_size,
              g.data(),
              inverse_ete_g.data());
          UpdateRhs(chunk, A, b, chunk.start, inverse_ete_g.data(), rhs);
        }

        // S -= F'E(E'E)^{-1}E'F
        ChunkOuterProduct(
            thread_id, bs, inverse_ete, buffer, chunk.buffer_layout, lhs);
      });

  // For rows with no e_blocks, the schur complement update reduces to
  // S += F'F.
  NoEBlockRowsUpdate(A, b, uneliminated_row_begins_, lhs, rhs);
}

可以看出循环次数为chunks_.size(),代表参数块总个数。每次循环将对应参数快所有的雅可比矩阵做计算得到lhs随影一部分以及rhs对应的一部分。循环内部主要包含了三个函数ChunkDiagonalBlockAndGradient(),UpdateRhs(),ChunkOuterProduct()。
其中ChunkDiagonalBlockAndGradient()函数有两个功能,一个是计算F‘F中视觉部分对他的贡献,一个是计算E’F矩阵,保存在临时缓冲区buffer中。代码为:


// Given a Chunk - set of rows with the same e_block, e.g. in the
// following Chunk with two rows.
//
//                E                   F
//      [ y11   0   0   0 |  z11     0    0   0    z51]
//      [ y12   0   0   0 |  z12   z22    0   0      0]
//
// this function computes twp matrices. The diagonal block matrix
//
//   ete = y11 * y11' + y12 * y12'
//
// and the off diagonal blocks in the Guass Newton Hessian.
//
//   buffer = [y11'(z11 + z12), y12' * z22, y11' * z51]
//
// which are zero compressed versions of the block sparse matrices E'E
// and E'F.
//
// and the gradient of the e_block, E'b.
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::
    ChunkDiagonalBlockAndGradient(
        const Chunk& chunk,
        const BlockSparseMatrixData& A,
        const double* b,
        int row_block_counter,
        typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix* ete,
        double* g,
        double* buffer,
        BlockRandomAccessMatrix* lhs) {
  const CompressedRowBlockStructure* bs = A.block_structure();
  const double* values = A.values();

  int b_pos = bs->rows[row_block_counter].block.position;
  const int e_block_size = ete->rows();

  // Iterate over the rows in this chunk, for each row, compute the
  // contribution of its F blocks to the Schur complement, the
  // contribution of its E block to the matrix EE' (ete), and the
  // corresponding block in the gradient vector.
  for (int j = 0; j < chunk.size; ++j) {
    const CompressedRow& row = bs->rows[row_block_counter + j];

    if (row.cells.size() > 1) {
      EBlockRowOuterProduct(A, row_block_counter + j, lhs);
    }

    // Extract the e_block, ETE += E_i' E_i
    const Cell& e_cell = row.cells.front();
    // clang-format off
    MatrixTransposeMatrixMultiply
        <kRowBlockSize, kEBlockSize, kRowBlockSize, kEBlockSize, 1>(
            values + e_cell.position, row.block.size, e_block_size,
            values + e_cell.position, row.block.size, e_block_size,
            ete->data(), 0, 0, e_block_size, e_block_size);
    // clang-format on

    if (b) {
      // g += E_i' b_i
      // clang-format off
      MatrixTransposeVectorMultiply<kRowBlockSize, kEBlockSize, 1>(
          values + e_cell.position, row.block.size, e_block_size,
          b + b_pos,
          g);
      // clang-format on
    }

    // buffer = E'F. This computation is done by iterating over the
    // f_blocks for each row in the chunk.
    for (int c = 1; c < row.cells.size(); ++c) {
      const int f_block_id = row.cells[c].block_id;
      const int f_block_size = bs->cols[f_block_id].size;
      double* buffer_ptr = buffer + FindOrDie(chunk.buffer_layout, f_block_id);
      // clang-format off
      MatrixTransposeMatrixMultiply
          <kRowBlockSize, kEBlockSize, kRowBlockSize, kFBlockSize, 1>(
          values + e_cell.position, row.block.size, e_block_size,
          values + row.cells[c].position, row.block.size, f_block_size,
          buffer_ptr, 0, 0, e_block_size, f_block_size);
      // clang-format on
    }
    b_pos += row.block.size;
  }
}

为了这部分优化方便,同时将来方便于嵌入式平台优化,我们将其分开处理,首先第一部分并且将内部调用的宏展开后并使用AVX指令得到:
for(int i = 0;i < chunks_.size();i++)
{
const Chunk& chunk = chunks_[i];
int r, c, row_stride, col_stride;
CellInfo* cell_info =
lhs->GetCell(0, 0, &r, &c, &row_stride, &col_stride);
double *lhs_val = cell_info->values;
for (int j = 0; j < chunk.size; ++j)
{
const CompressedRow& row = bs->rows[chunk.start + j];

	    //EBlockRowOuterProduct(A, chunk.start + j, lhs);


	    int block1 = row.cells[1].block_id - num_eliminate_blocks_;
	    int block1_size = bs->cols[row.cells[1].block_id].size;

		r = block_layout_[block1];
		c = block_layout_[block1];

		double *val_ptr = (double *)(values + row.cells[1].position);
		__m256i mask = _mm256_setr_epi32(-20, -20, -48, -9,20, 20, 48, 9);	
		__m256d val_b01 = _mm256_set_pd(0,0,0,0);
		__m256d val_b11 = _mm256_set_pd(0,0,0,0);
		for(int k = 0;k < block1_size;k++)
		{
			double *lhs_val_ptr = &lhs_val[(k + r) * col_stride + 0 + c];
	        double a_val0 = val_ptr[0 * block1_size  + k];
			double a_val1 = val_ptr[1 * block1_size  + k];
			__m256d v_sum0 = _mm256_loadu_pd(&lhs_val[(k + r) * col_stride + c]);
			__m256d v_sum1 = _mm256_maskload_pd(&lhs_val[(k + r) * col_stride + c + 4],mask);
			__m256d v_val0 = _mm256_set_pd(a_val0,a_val0,a_val0,a_val0);
			__m256d v_val1 = _mm256_set_pd(a_val1,a_val1,a_val1,a_val1);
            __m256d val_b00 = _mm256_loadu_pd(&val_ptr[0]);
			__m256d val_b10 = _mm256_loadu_pd(&val_ptr[block1_size]);
			
			val_b01 = _mm256_maskload_pd(&val_ptr[4],mask);
			val_b11 = _mm256_maskload_pd(&val_ptr[block1_size + 4],mask);
			
			v_sum0 = _mm256_fmadd_pd(v_val0,val_b00,v_sum0);
			v_sum0 = _mm256_fmadd_pd(v_val1,val_b10,v_sum0);
			
			v_sum1 = _mm256_fmadd_pd(v_val0,val_b01,v_sum1);
			v_sum1 = _mm256_fmadd_pd(v_val1,val_b11,v_sum1);
			double *v_sum0_ptr = (double *)&v_sum0;
			double *v_sum1_ptr = (double *)&v_sum1;
			_mm256_storeu_pd(lhs_val_ptr,v_sum0);
			lhs_val_ptr[4] = v_sum1_ptr[0];
			lhs_val_ptr[5] = v_sum1_ptr[1];
		    
		}

		   
	      const int block2 = row.cells[2].block_id - num_eliminate_blocks_;
	      const int block2_size = bs->cols[row.cells[2].block_id].size;
	     // CellInfo* cell_info =
	          //lhs->GetCell(block1, block2, &r, &c, &row_stride, &col_stride);

		  r = block_layout_[block1];
		  c = block_layout_[block2];

	
	      double *val_ptr0 = (double *)(values + row.cells[1].position);
		  double *val_ptr1 = (double *)(values + row.cells[2].position);
		
	      for(int k = 0;k < block1_size;k++)
	      {

			double *lhs_val_ptr = &lhs_val[(k + r) * col_stride + 0 + c];
	        double a_val0 = val_ptr0[0 * block1_size  + k];
			double a_val1 = val_ptr0[1 * block1_size  + k];
			__m256d v_sum0 = _mm256_loadu_pd(&lhs_val[(k + r) * col_stride + c]);
			__m256d v_sum1 = _mm256_maskload_pd(&lhs_val[(k + r) * col_stride + c + 4],mask);
			__m256d v_val0 = _mm256_set_pd(a_val0,a_val0,a_val0,a_val0);
			__m256d v_val1 = _mm256_set_pd(a_val1,a_val1,a_val1,a_val1);
            __m256d val_b00 = _mm256_loadu_pd(&val_ptr1[0]);
			__m256d val_b10 = _mm256_loadu_pd(&val_ptr1[block1_size]);
			
			val_b01 = _mm256_maskload_pd(&val_ptr1[4],mask);
			val_b11 = _mm256_maskload_pd(&val_ptr1[block1_size + 4],mask);
			
			v_sum0 = _mm256_fmadd_pd(v_val0,val_b00,v_sum0);
			v_sum0 = _mm256_fmadd_pd(v_val1,val_b10,v_sum0);
			
			v_sum1 = _mm256_fmadd_pd(v_val0,val_b01,v_sum1);
			v_sum1 = _mm256_fmadd_pd(v_val1,val_b11,v_sum1);
			double *v_sum0_ptr = (double *)&v_sum0;
			double *v_sum1_ptr = (double *)&v_sum1;
			_mm256_storeu_pd(lhs_val_ptr,v_sum0);
			lhs_val_ptr[4] = v_sum1_ptr[0];
			lhs_val_ptr[5] = v_sum1_ptr[1];
		  	
	      }
	      
			block1 = row.cells[2].block_id - num_eliminate_blocks_;
			block1_size = bs->cols[row.cells[2].block_id].size;

			r = block_layout_[block1];
			c = block_layout_[block1];

			val_ptr = (double *)(values + row.cells[2].position);
					 
			for(int k = 0;k < block1_size;k++)
			{
				double *lhs_val_ptr = &lhs_val[(k + r) * col_stride + 0 + c];
			    double a_val0 = val_ptr[0 * block1_size  + k];
				double a_val1 = val_ptr[1 * block1_size  + k];
				__m256d v_sum0 = _mm256_loadu_pd(&lhs_val[(k + r) * col_stride + c]);
				__m256d v_sum1 = _mm256_maskload_pd(&lhs_val[(k + r) * col_stride + c + 4],mask);
				__m256d v_val0 = _mm256_set_pd(a_val0,a_val0,a_val0,a_val0);
				__m256d v_val1 = _mm256_set_pd(a_val1,a_val1,a_val1,a_val1);
			    __m256d val_b00 = _mm256_loadu_pd(&val_ptr[0]);
				__m256d val_b10 = _mm256_loadu_pd(&val_ptr[block1_size]);
				
				val_b01 = _mm256_maskload_pd(&val_ptr[4],mask);
				val_b11 = _mm256_maskload_pd(&val_ptr[block1_size + 4],mask);
				
				v_sum0 = _mm256_fmadd_pd(v_val0,val_b00,v_sum0);
				v_sum0 = _mm256_fmadd_pd(v_val1,val_b10,v_sum0);
				
				v_sum1 = _mm256_fmadd_pd(v_val0,val_b01,v_sum1);
				v_sum1 = _mm256_fmadd_pd(v_val1,val_b11,v_sum1);
				double *v_sum0_ptr = (double *)&v_sum0;
				double *v_sum1_ptr = (double *)&v_sum1;
				_mm256_storeu_pd(lhs_val_ptr,v_sum0);
	
				lhs_val_ptr[4] = v_sum1_ptr[0];
				lhs_val_ptr[5] = v_sum1_ptr[1];
			    
			}
			
		  
      }
  }

此处有些循环由于次数为固定值1或者6,因此将其展开使用avx指令来计算提升效率。
可以看出大循环内部每一个参数块都需要计算e’e以及逆,这里由于e小矩阵维数为1,因此逆矩阵直接取其倒数,并且将每一组数据保存在一个临时缓存区。代码如下:

 const CompressedRowBlockStructure* bs = A.block_structure();
		  const double* values = A.values();

		  int b_pos = bs->rows[chunk.start].block.position;

		  // Iterate over the rows in this chunk, for each row, compute the
		  // contribution of its F blocks to the Schur complement, the
		  // contribution of its E block to the matrix EE' (ete), and the
		  // corresponding block in the gradient vector.
		  double mask_buf0[4] = {-1,-2,-3,-4};
		  double mask_buf1[4] = {1,2,3,4};

		  int mask_num = (chunk.size & 0x3);
		  __m256d mask0 = _mm256_loadu_pd(mask_buf0);
		  __m256d mask1 = _mm256_loadu_pd(mask_buf1);
		  
		  for(int j = 0;j < mask_num;j++)
		  {
		  	mask_buf0[j] = mask_buf1[j];
		  }
		  __m256d mask = _mm256_loadu_pd(mask_buf0);
	 
		  __m256d zero = _mm256_set1_pd(0.0);
	
		  double ete_t = ete,g_t = g;
		  double zero_t = 0.0;
		  __m256d vSum0 = _mm256_broadcast_sd(&zero_t);
		  __m256d vSum1 = _mm256_broadcast_sd(&zero_t);
		  __m256d vgSum0 = _mm256_broadcast_sd(&zero_t);
		  __m256d vgSum1 = _mm256_broadcast_sd(&zero_t);
	
		  __m256i vindex0,vindex1;

		  vindex0 = _mm256_set_epi64x(24,16,8,0);
		  vindex1 = _mm256_set_epi64x(28,20,12,4);
		  double ete_tt = ete_t;
		  double g_tt = 0;
		 for (int j = 0; j < chunk.size; j += 4)
		  {
			// clang-format off
			/*
			MatrixTransposeMatrixMultiply
			<kRowBlockSize, kEBlockSize, kRowBlockSize, kEBlockSize, 1>(
			    values + e_cell.position, row.block.size, e_block_size,
			    values + e_cell.position, row.block.size, e_block_size,
			    ete.data(), 0, 0, e_block_size, e_block_size);
			*/       
			
			double *val_ptr = (double *)(values + idx * 2);
			int num = 4;
			if((j + 4) > chunk.size)
			{
				num = chunk.size & 0x3;
				mask1 = mask;
			}

			__m256d v_val0 =  _mm256_i64gather_pd (val_ptr, vindex0, 2);
			__m256d v_val1 =  _mm256_i64gather_pd (val_ptr, vindex1, 2);
			
		    v_val0 = _mm256_blendv_pd(v_val0,zero,mask1);
			v_val1 = _mm256_blendv_pd(v_val1,zero,mask1);
	        vSum0 = _mm256_fmadd_pd(v_val0,v_val0,vSum0);
			vSum1 = _mm256_fmadd_pd(v_val1,v_val1,vSum1);

			// clang-format on
			//const CompressedRow& row = bs->rows[chunk.start + j];
			// Extract the e_block, ETE += E_i' E_i
			//const Cell& e_cell = row.cells.front();
			// clang-format off  

			// g += E_i' b_i
			// clang-format off
			double * b_ptr = (double*)(b + idx * 2);
			__m256d v_val_b_0 =  _mm256_i64gather_pd (b_ptr, vindex0, 2);
			__m256d v_val_b_1 =  _mm256_i64gather_pd (b_ptr, vindex1, 2);
	
	        vgSum0 = _mm256_fmadd_pd(v_val0,v_val_b_0,vgSum0);
			vgSum1 = _mm256_fmadd_pd(v_val1,v_val_b_1,vgSum1);
			
            idx += num;
	
		  }
			ete_tt += vSum0[0] + vSum0[1] + vSum0[2] + vSum0[3];
			ete_tt += vSum1[0] + vSum1[1] + vSum1[2] + vSum1[3];
			g_tt += vgSum0[0] + vgSum0[1] + vgSum0[2] + vgSum0[3];
			g_tt += vgSum1[0] + vgSum1[1] + vgSum1[2] + vgSum1[3];
			ete = ete_tt;
			g = g_tt;	
			// Normally one wouldn't compute the inverse explicitly, but
			// e_block_size will typically be a small number like 3, in
			// which case its much faster to compute the inverse once and
			// use it to multiply other matrices/vectors instead of doing a
			// Solve call over and over again.
	        double inverse_ete = (double)1/ete;
			// For the current chunk compute and update the rhs of the reduced
			// linear system.
			//
			//	 rhs = F'b - F'E(E'E)^(-1) E'b
			  double  inverse_ete_g = inverse_ete * g;

			 //UpdateRhs(chunk, A, b, chunk.start, inverse_ete_g.data(), rhs);
		     ete_inv[i * e_block_size * e_block_size] = inverse_ete;
			 ete_inv_g[i * e_block_size] = inverse_ete_g;
			
	  }

UpdateRhs()为计算如下矩阵:
F’b - F’E(E’E)^(-1) E’b
内部代码如下:


// Update the rhs of the reduced linear system. Compute
//
//   F'b - F'E(E'E)^(-1) E'b
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::UpdateRhs(
    const Chunk& chunk,
    const BlockSparseMatrixData& A,
    const double* b,
    int row_block_counter,
    const double* inverse_ete_g,
    double* rhs) {
  const CompressedRowBlockStructure* bs = A.block_structure();
  const double* values = A.values();

  const int e_block_id = bs->rows[chunk.start].cells.front().block_id;
  const int e_block_size = bs->cols[e_block_id].size;
  //LOG(INFO) << "UpdateRhs " << e_block_size;
  int b_pos = bs->rows[row_block_counter].block.position;
  for (int j = 0; j < chunk.size; ++j) {
    const CompressedRow& row = bs->rows[row_block_counter + j];
    const Cell& e_cell = row.cells.front();

    typename EigenTypes<kRowBlockSize>::Vector sj =
        typename EigenTypes<kRowBlockSize>::ConstVectorRef(b + b_pos,
                                                           row.block.size);

    // clang-format off
    MatrixVectorMultiply<kRowBlockSize, kEBlockSize, -1>(
        values + e_cell.position, row.block.size, e_block_size,
        inverse_ete_g, sj.data());
    // clang-format on
    
    for (int c = 1; c < row.cells.size(); ++c) {
      const int block_id = row.cells[c].block_id;
      const int block_size = bs->cols[block_id].size;
      const int block = block_id - num_eliminate_blocks_;
      std::lock_guard<std::mutex> l(*rhs_locks_[block]);
      // clang-format off
      MatrixTransposeVectorMultiply<kRowBlockSize, kFBlockSize, 1>(
          values + row.cells[c].position,
          row.block.size, block_size,
          sj.data(), rhs + lhs_row_layout_[block]);
	 // double *rhs_ptr = rhs + lhs_row_layout_[block];
	  //if(calc_idx > 20 && calc_idx < 30 && param_idx < 5)
			//LOG(INFO) << "Eliminate " << lhs_row_layout_[block];
      // clang-format on
    }
    b_pos += row.block.size;
  }
}

根据雅可比矩阵维数特性为固定值,因此这里我们同样将内部循环简化,使用avx指令得到:

int b_pos = 0;
	  for(int i = 0;i < chunks_.size();i++) 
	  {
          const Chunk& chunk = chunks_[i];
		  FixedArray<double, 8> inverse_ete_g(1);
		  
		  inverse_ete_g.data()[0] = ete_inv_g[i];
		  double inverse_ete_eb = ete_inv_g[i];
		  double zero_t = 0.0;
		  __m256d v_zero = _mm256_broadcast_sd(&zero_t);
		  __m256d mask = _mm256_set_pd(-1,-1,2,2);
		  __m256i maski64 = _mm256_set_epi64x(-1,-1,2,2);
		  for (int j = 0; j < chunk.size; ++j) 
		  {
		    const CompressedRow& row = bs->rows[chunk.start + j];
	
		    typename EigenTypes<kRowBlockSize>::Vector sj =
		        typename EigenTypes<kRowBlockSize>::ConstVectorRef(b + b_pos,2);

		    // clang-format off
		    //MatrixVectorMultiply<kRowBlockSize, kEBlockSize, -1>(
		       // values + e_cell.position, row.block.size, e_block_size,
		       // inverse_ete_g.data(), sj.data());
			double *val_ptr0 = (double *)(values + b_pos);
			double *val_ptr1 = (double *)(values + b_pos + 1);
		    
		    sj.data()[0] -= val_ptr0[0] * inverse_ete_eb;
		    sj.data()[1] -= val_ptr1[0] * inverse_ete_eb;
			
		    // clang-format on
		    int block_id = row.cells[1].block_id;
		    int block_size = bs->cols[block_id].size;
		    int block = block_id - num_eliminate_blocks_;
		  
		    // clang-format off
			double sj_0 = sj.data()[0];
			double sj_1 = sj.data()[1];
			  
	      // clang-format off
	        
	        double *val_ptr = (double *)(values + row.cells[1].position);
			double *sj_ptr = sj.data();
			double *rhs_ptr = rhs + lhs_row_layout_[block];
	   
		   
			__m256d v_sj_0 = _mm256_broadcast_sd(&sj_0);
			__m256d v_sj_1 = _mm256_broadcast_sd(&sj_1);
			
			__m256d val_00 = _mm256_loadu_pd(val_ptr);
		   __m256d val_01 = _mm256_loadu_pd(&val_ptr[block_size]);
		   __m256d rhs_val0 = _mm256_loadu_pd(rhs_ptr);
		   __m256d val_10 = _mm256_loadu_pd(&val_ptr[4]);
		   __m256d val_11 = _mm256_loadu_pd(&val_ptr[block_size + 4]);
		   __m256d rhs_val1 = _mm256_loadu_pd(&rhs_ptr[4]);

		    val_10 = _mm256_blendv_pd(val_10,v_zero,mask);
			val_11 = _mm256_blendv_pd(val_11,v_zero,mask);
			
			rhs_val0 = _mm256_fmadd_pd(val_00,v_sj_0,rhs_val0);
			rhs_val0 = _mm256_fmadd_pd(val_01,v_sj_1,rhs_val0);
			

			rhs_val1 = _mm256_fmadd_pd(val_10,v_sj_0,rhs_val1);
			rhs_val1 = _mm256_fmadd_pd(val_11,v_sj_1,rhs_val1);
	
		    
			 _mm256_storeu_pd(&rhs_ptr[0],rhs_val0);
			 rhs_ptr[4] = rhs_val1[0];
			 rhs_ptr[5] = rhs_val1[1];

			 block_id = row.cells[2].block_id;
	         block_size = bs->cols[block_id].size;
	         block = block_id - num_eliminate_blocks_;
	      // clang-format off
			val_ptr = (double *)(values + row.cells[2].position);
			sj_ptr = sj.data();
			rhs_ptr = rhs + lhs_row_layout_[block];
			val_00 = _mm256_loadu_pd(val_ptr);
			val_01 = _mm256_loadu_pd(&val_ptr[block_size]);
			rhs_val0 = _mm256_loadu_pd(rhs_ptr);
			val_10 = _mm256_loadu_pd(&val_ptr[4]);
			val_11 = _mm256_loadu_pd(&val_ptr[block_size + 4]);
			rhs_val1 = _mm256_loadu_pd(&rhs_ptr[4]);

			val_10 = _mm256_blendv_pd(val_10,v_zero,mask);
			val_11 = _mm256_blendv_pd(val_11,v_zero,mask);

			rhs_val0 = _mm256_fmadd_pd(val_00,v_sj_0,rhs_val0);
			rhs_val0 = _mm256_fmadd_pd(val_01,v_sj_1,rhs_val0);


			rhs_val1 = _mm256_fmadd_pd(val_10,v_sj_0,rhs_val1);
			rhs_val1 = _mm256_fmadd_pd(val_11,v_sj_1,rhs_val1);
			// clang-format on
			_mm256_storeu_pd(&rhs_ptr[0],rhs_val0);
			rhs_ptr[4] = rhs_val1[0];
			rhs_ptr[5] = rhs_val1[1];
		    b_pos += 2;
		  }
	  }

下一步计算由于需要用到E’F矩阵,因此这里需要重新写一个循环来计算这部分,代码如下:

		for (int j = 0; j < chunk.size; ++j)
		{
		  	  const CompressedRow& row = bs->rows[chunk.start + j];
			  const Cell& e_cell = row.cells.front();
		  	// buffer = E'F. This computation is done by iterating over the
		    // f_blocks for each row in the chunk.

		      int f_block_id = row.cells[1].block_id;
		      int f_block_size = bs->cols[f_block_id].size;

		      double* buffer_ptr = buffer;
			  // clang-format off
			  double *val_ptr0 = (double *)(values + e_cell.position);
			  double *val_ptr1 = (double *)(values + row.cells[1].position);
              
			  __m256d v_val0 =  _mm256_loadu_pd(val_ptr1);
		      __m256d v_val1 =  _mm256_loadu_pd(&val_ptr1[4]);
	          __m256d v_sum0 =  _mm256_loadu_pd(buffer_ptr);
			  __m256d v_sum1 =  _mm256_loadu_pd(&buffer_ptr[4]);
			  
			  __m256d v_val2 =  _mm256_loadu_pd(&val_ptr1[6]);
		      __m256d v_val3 =  _mm256_loadu_pd(&val_ptr1[10]);
			  
			  __m256d v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
		      __m256d v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);
             
			  v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
			  v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
		      v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
			  
			  v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
			  v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
			
              v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
			  v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
			
			  _mm256_storeu_pd(&buffer_ptr[0],v_sum0);
			  buffer_ptr[4] = v_sum1[0];
			  buffer_ptr[5] = v_sum1[1];


			  f_block_id = row.cells[2].block_id;
		      f_block_size = bs->cols[f_block_id].size;
		      buffer_ptr = buffer + FindOrDie(chunk.buffer_layout, f_block_id);
			  // clang-format off
			  val_ptr0 = (double *)(values + e_cell.position);
			  val_ptr1 = (double *)(values + row.cells[2].position);

			  v_val0 =  _mm256_loadu_pd(val_ptr1);
		      v_val1 =  _mm256_loadu_pd(&val_ptr1[4]);
	          v_sum0 =  _mm256_loadu_pd(buffer_ptr);
			  v_sum1 =  _mm256_loadu_pd(&buffer_ptr[4]);
			  v_val2 =  _mm256_loadu_pd(&val_ptr1[6]);
		      v_val3 =  _mm256_loadu_pd(&val_ptr1[10]);
		      v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
			  v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
		      v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);

			  v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
			  v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
			  v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
   
			  v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
			  v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
			
              v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
			  v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
              _mm256_storeu_pd(&buffer_ptr[0],v_sum0);
			  buffer_ptr[4] = v_sum1[0];
			  buffer_ptr[5] = v_sum1[1];
	
		      // clang-format on
		  }

为了减少计算量,将F’矩阵提取出来变型为:
F’(b-E(E’E)^(-1) E’b)
ChunkOuterProduct()为计算:
S -= F’E(E’E)^{-1}E’F。
内部代码如下:


// Compute the outer product F'E(E'E)^{-1}E'F and subtract it from the
// Schur complement matrix, i.e
//
//  S -= F'E(E'E)^{-1}E'F.
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::
    ChunkOuterProduct(int thread_id,
                      const CompressedRowBlockStructure* bs,
                      const Matrix& inverse_ete,
                      const double* buffer,
                      const BufferLayoutType& buffer_layout,
                      BlockRandomAccessMatrix* lhs) {
  // This is the most computationally expensive part of this
  // code. Profiling experiments reveal that the bottleneck is not the
  // computation of the right-hand matrix product, but memory
  // references to the left hand side.
  const int e_block_size = inverse_ete.rows();
  BufferLayoutType::const_iterator it1 = buffer_layout.begin();

  double* b1_transpose_inverse_ete =
      chunk_outer_product_buffer_.get() + thread_id * buffer_size_;

  // S(i,j) -= bi' * ete^{-1} b_j
  for (; it1 != buffer_layout.end(); ++it1) {
    const int block1 = it1->first - num_eliminate_blocks_;
    const int block1_size = bs->cols[it1->first].size;
    // clang-format off
    MatrixTransposeMatrixMultiply
        <kEBlockSize, kFBlockSize, kEBlockSize, kEBlockSize, 0>(
        buffer + it1->second, e_block_size, block1_size,
        inverse_ete.data(), e_block_size, e_block_size,
        b1_transpose_inverse_ete, 0, 0, block1_size, e_block_size);
    // clang-format on

    BufferLayoutType::const_iterator it2 = it1;
    for (; it2 != buffer_layout.end(); ++it2) {
      const int block2 = it2->first - num_eliminate_blocks_;

      int r, c, row_stride, col_stride;
      CellInfo* cell_info =
          lhs->GetCell(block1, block2, &r, &c, &row_stride, &col_stride);
	  
      if (cell_info != NULL) {
        const int block2_size = bs->cols[it2->first].size;

	    //LOG(INFO) << " " << block1_size << " " << block2_size;
        std::lock_guard<std::mutex> l(cell_info->m);
        // clang-format off
        MatrixMatrixMultiply
            <kFBlockSize, kEBlockSize, kEBlockSize, kFBlockSize, -1>(
                b1_transpose_inverse_ete, block1_size, e_block_size,
                buffer  + it2->second, e_block_size, block2_size,
                cell_info->values, r, c, row_stride, col_stride);
        // clang-format on
      }
    }
  }
}

这部分使用指令重写后代码为:

      for(int i = 0;i < chunks_.size();i++) 
	  {
	  	    double* buffer = buffer_.get();
			const Chunk& chunk = chunks_[i];
			VectorRef(buffer, buffer_size_).setZero();
		    
		    const CompressedRowBlockStructure* bs = A.block_structure();
		    const double* values = A.values();
		    typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix inverse_ete(1,1);
			inverse_ete.data()[0] = ete_inv[i];
			__m256i vindex0 = _mm256_set_epi64x(24,16,8,0);
		    __m256i vindex1 = _mm256_set_epi64x(28,20,12,4);
			double zero_t = 0.0;
		    __m256d v_zero = _mm256_broadcast_sd(&zero_t);
		    __m256d mask = _mm256_set_pd(-1,-1,2,2);
            
			for (int j = 0; j < chunk.size; ++j)
			{
			  	  const CompressedRow& row = bs->rows[chunk.start + j];
				  const Cell& e_cell = row.cells.front();
			  	// buffer = E'F. This computation is done by iterating over the
			    // f_blocks for each row in the chunk.

			      int f_block_id = row.cells[1].block_id;
			      int f_block_size = bs->cols[f_block_id].size;
	
			      double* buffer_ptr = buffer;
				  // clang-format off
				  double *val_ptr0 = (double *)(values + e_cell.position);
				  double *val_ptr1 = (double *)(values + row.cells[1].position);
                  
				  __m256d v_val0 =  _mm256_loadu_pd(val_ptr1);
			      __m256d v_val1 =  _mm256_loadu_pd(&val_ptr1[4]);
		          __m256d v_sum0 =  _mm256_loadu_pd(buffer_ptr);
				  __m256d v_sum1 =  _mm256_loadu_pd(&buffer_ptr[4]);
				  
				  __m256d v_val2 =  _mm256_loadu_pd(&val_ptr1[6]);
			      __m256d v_val3 =  _mm256_loadu_pd(&val_ptr1[10]);
				  
				  __m256d v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
			      __m256d v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);
                 
				  v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
				  v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
			      v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
				  
				  v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
				  v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
				
	              v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
				  v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
				
				  _mm256_storeu_pd(&buffer_ptr[0],v_sum0);
				  buffer_ptr[4] = v_sum1[0];
				  buffer_ptr[5] = v_sum1[1];
	
	
				  f_block_id = row.cells[2].block_id;
			      f_block_size = bs->cols[f_block_id].size;
			      buffer_ptr = buffer + FindOrDie(chunk.buffer_layout, f_block_id);
				  // clang-format off
				  val_ptr0 = (double *)(values + e_cell.position);
				  val_ptr1 = (double *)(values + row.cells[2].position);
	
				  v_val0 =  _mm256_loadu_pd(val_ptr1);
			      v_val1 =  _mm256_loadu_pd(&val_ptr1[4]);
		          v_sum0 =  _mm256_loadu_pd(buffer_ptr);
				  v_sum1 =  _mm256_loadu_pd(&buffer_ptr[4]);
				  v_val2 =  _mm256_loadu_pd(&val_ptr1[6]);
			      v_val3 =  _mm256_loadu_pd(&val_ptr1[10]);
			      v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
				  v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
			      v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);

				  v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
				  v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
				  v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
       
				  v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
				  v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
				
	              v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
				  v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
                  _mm256_storeu_pd(&buffer_ptr[0],v_sum0);
				  buffer_ptr[4] = v_sum1[0];
				  buffer_ptr[5] = v_sum1[1];
		
			      // clang-format on
			  }
	        param_idx++;
			// S -= F'E(E'E)^{-1}E'F
			//ChunkOuterProduct(
				//0, bs, inverse_ete, buffer, chunk.buffer_layout, lhs);
		  const int e_block_size = 1;
		  BufferLayoutType::const_iterator it1 = chunk.buffer_layout.begin();

		  double* b1_transpose_inverse_ete = chunk_outer_product_buffer_.get();
          //CellInfo* cell_info =
			         // lhs->GetCell(0, 0, &r, &c, &row_stride, &col_stride);
		  //double *lhs_val =  cell_info->values;
		  // S(i,j) -= bi' * ete^{-1} b_j
		  int r, c, row_stride, col_stride;
		  CellInfo* cell_info =
			          lhs->GetCell(0, 0, &r, &c, &row_stride, &col_stride);
		  double *lhs_val =  cell_info->values;

	
		  v_zero = _mm256_broadcast_sd(&zero_t);
		  mask = _mm256_set_pd(-1,-1,2,2);
		  for (; it1 != chunk.buffer_layout.end(); ++it1) 
		  {
		    const int block1 = it1->first - num_eliminate_blocks_;

			double *buffer_ptr = buffer + it1->second;
		    // clang-format on
		    __m256d vinverse_ete = _mm256_broadcast_sd(inverse_ete.data());
			__m256d v_val0 =  _mm256_loadu_pd(buffer_ptr);
			__m256d v_val1 =  _mm256_loadu_pd(&buffer_ptr[4]);
			v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
			__m256d btete0 = _mm256_mul_pd(v_val0,vinverse_ete);
			__m256d btete1 = _mm256_mul_pd(v_val1,vinverse_ete);
			_mm256_storeu_pd(&b1_transpose_inverse_ete[0],btete0);
			b1_transpose_inverse_ete[4] = btete1[0];
			b1_transpose_inverse_ete[5] = btete1[1];
     
		    BufferLayoutType::const_iterator it2 = it1;
		    for (; it2 != chunk.buffer_layout.end(); ++it2) 
			{
		      const int block2 = it2->first - num_eliminate_blocks_;
			  
              //cell_info = lhs->GetCell(block1, block2, &r, &c, &row_stride, &col_stride);
			  r = block_layout_[block1];
 			  c = block_layout_[block2];
			  buffer_ptr = buffer + it2->second;
			  double *lhs_val_ptr = &lhs_val[r * col_stride + c];
	      
	          for(int j = 0;j < 6;j++)
	          {
			  	__m256d vb1_inverse_ete = _mm256_broadcast_sd(&b1_transpose_inverse_ete[j]);
				 v_val0 =  _mm256_loadu_pd(buffer_ptr);
			     v_val1 =  _mm256_loadu_pd(&buffer_ptr[4]);
				 v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
				 __m256d res0 = _mm256_loadu_pd(&lhs_val_ptr[j * col_stride + 0]);
				 __m256d res1 = _mm256_loadu_pd(&lhs_val_ptr[j * col_stride + 4]);
				 __m256d btete0 = _mm256_mul_pd(vb1_inverse_ete,v_val0);
			     __m256d btete1 = _mm256_mul_pd(vb1_inverse_ete,v_val1);
				 res0 = _mm256_sub_pd(res0,btete0);
				 res1 = _mm256_sub_pd(res1,btete1);
				 _mm256_storeu_pd(&lhs_val_ptr[j * col_stride + 0],res0);
				 lhs_val_ptr[j * col_stride + 4] = res1[0];
				 lhs_val_ptr[j * col_stride + 5] = res1[1];
	          }
		      
		    }
		  }
		  
		  }

整体减完成后,也就是循环完成后调用了函数NoEBlockRowsUpdate,再计算S += F’F,这部分耗时较短,也可以暂时忽略。
上述代码较多,为了说明流程,这里并非全部代码,需要这部分优化代码可以加我微信yhtao923,我们可以交流一下。

本文标签: 库内性能代码VINSLAM