admin管理员组

文章数量:1590572


如何确认你的代码被编译期SIMD向量化?从上面的视频学到了基本操作方法。

一般例子

举个例子。代码如下:

  1 // Copyright 1999-2023 Alibaba Inc. All Rights Reserved.
  2 // Author:
  3 //   xiaochu.yh@alipay
  4 //
  5
  6 #include <stdio.h>
  7
  8 int main(int argc, const char *argv[])
  9 {
 10   int j = 0;
 11   for (int i = 0; i < 1024; ++i) {
 12     j += i;
 13   }
 14   printf("j = %d\n", j);
 15   return 0;
 16 }

带上所有参数做编译:

[xiaochu.yh ~/tools/vector] $g++ vector.cpp -fopt-info-vec-optimized -fopt-info-vec-missed -fopt-info-vec-note -fopt-info-vec-all
[xiaochu.yh ~/tools/vector] $./a.out
j = 523776

没有任何输出?原来缺了 -O3 优化。加上 -O3 后,再次编译,输出如下。

[xiaochu.yh ~/tools/vector] $g++ vector.cpp -O3 -fopt-info-vec-optimized -fopt-info-vec-missed -fopt-info-vec-note -fopt-info-vec-all

Analyzing loop at vector.cpp:11

vector.cpp:11: note: ===== analyze_loop_nest =====
vector.cpp:11: note: === vect_analyze_loop_form ===
vector.cpp:11: note: === get_loop_niters ===
vector.cpp:11: note: ==> get_loop_niters:1024
vector.cpp:11: note: === vect_analyze_data_refs ===

vector.cpp:11: note: === vect_analyze_scalar_cycles ===
vector.cpp:11: note: Analyze phi: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: Access function of PHI: {0, +, i_11}_1
vector.cpp:11: note: step: i_11,  init: 0
vector.cpp:11: note: step unknown.
vector.cpp:11: note: Analyze phi: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: Access function of PHI: {0, +, 1}_1
vector.cpp:11: note: step: 1,  init: 0
vector.cpp:11: note: Detected induction.
vector.cpp:11: note: Analyze phi: ivtmp_2 = PHI <ivtmp_1(4), 1024(2)>

vector.cpp:11: note: Access function of PHI: {1024, +, 4294967295}_1
vector.cpp:11: note: step: 4294967295,  init: 1024
vector.cpp:11: note: Detected induction.
vector.cpp:11: note: Analyze phi: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: detected reduction: need to swap operands: j_3 = j_10 + i_11;

vector.cpp:11: note: Detected reduction.
vector.cpp:11: note: === vect_pattern_recog ===
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: === vect_mark_stmts_to_be_vectorized ===
vector.cpp:11: note: init: phi relevant? j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: init: phi relevant? i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: init: phi relevant? ivtmp_2 = PHI <ivtmp_1(4), 1024(2)>

vector.cpp:11: note: init: stmt relevant? j_3 = i_11 + j_10;

vector.cpp:11: note: vec_stmt_relevant_p: used out of loop.
vector.cpp:11: note: mark relevant 0, live 1.
vector.cpp:11: note: init: stmt relevant? i_4 = i_11 + 1;

vector.cpp:11: note: init: stmt relevant? ivtmp_1 = ivtmp_2 - 1;

vector.cpp:11: note: init: stmt relevant? if (ivtmp_1 != 0)

vector.cpp:11: note: worklist: examine stmt: j_3 = i_11 + j_10;

vector.cpp:11: note: vect_is_simple_use: operand j_10
vector.cpp:11: note: def_stmt: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: type of def: 5.
vector.cpp:11: note: mark relevant 3, live 0.
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: mark relevant 3, live 0.
vector.cpp:11: note: worklist: examine stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: vect_is_simple_use: operand i_4
vector.cpp:11: note: def_stmt: i_4 = i_11 + 1;

vector.cpp:11: note: type of def: 3.
vector.cpp:11: note: mark relevant 3, live 0.
vector.cpp:11: note: vect_is_simple_use: operand 0
vector.cpp:11: note: worklist: examine stmt: i_4 = i_11 + 1;

vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: mark relevant 3, live 0.
vector.cpp:11: note: already marked relevant/live.
vector.cpp:11: note: worklist: examine stmt: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: vect_is_simple_use: operand j_3
vector.cpp:11: note: def_stmt: j_3 = i_11 + j_10;

vector.cpp:11: note: type of def: 5.
vector.cpp:11: note: reduc-stmt defining reduc-phi in the same nest.
vector.cpp:11: note: vect_is_simple_use: operand 0
vector.cpp:11: note: === vect_analyze_dependences ===
vector.cpp:11: note: === vect_determine_vectorization_factor ===
vector.cpp:11: note: ==> examining phi: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: get vectype for scalar type:  int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: nunits = 4
vector.cpp:11: note: ==> examining phi: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: get vectype for scalar type:  int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: nunits = 4
vector.cpp:11: note: ==> examining phi: ivtmp_2 = PHI <ivtmp_1(4), 1024(2)>

vector.cpp:11: note: ==> examining statement: j_3 = i_11 + j_10;

vector.cpp:11: note: get vectype for scalar type:  int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: get vectype for scalar type:  int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: nunits = 4
vector.cpp:11: note: ==> examining statement: i_4 = i_11 + 1;

vector.cpp:11: note: get vectype for scalar type:  int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: get vectype for scalar type:  int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: nunits = 4
vector.cpp:11: note: ==> examining statement: ivtmp_1 = ivtmp_2 - 1;

vector.cpp:11: note: skip.
vector.cpp:11: note: ==> examining statement: if (ivtmp_1 != 0)

vector.cpp:11: note: skip.
vector.cpp:11: note: vectorization factor = 4
vector.cpp:11: note: === vect_analyze_data_refs_alignment ===
vector.cpp:11: note: === vect_analyze_data_ref_accesses ===
vector.cpp:11: note: === vect_prune_runtime_alias_test_list ===
vector.cpp:11: note: === vect_enhance_data_refs_alignment ===
vector.cpp:11: note: vect_can_advance_ivs_p:
vector.cpp:11: note: Analyze phi: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: reduc phi. skip.
vector.cpp:11: note: Analyze phi: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: Access function of PHI: {0, +, 1}_1
vector.cpp:11: note: Analyze phi: ivtmp_2 = PHI <ivtmp_1(4), 1024(2)>

vector.cpp:11: note: Access function of PHI: {1024, +, 4294967295}_1
vector.cpp:11: note: === vect_analyze_slp ===
vector.cpp:11: note: === vect_make_slp_decision ===
vector.cpp:11: note: === vect_detect_hybrid_slp ===
vector.cpp:11: note: === vect_analyze_loop_operations ===
vector.cpp:11: note: examining phi: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: examining phi: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: === vectorizable_induction ===
vector.cpp:11: note: vect_model_induction_cost: inside_cost = 1, prologue_cost = 2 .
vector.cpp:11: note: examining phi: ivtmp_2 = PHI <ivtmp_1(4), 1024(2)>

vector.cpp:11: note: ==> examining statement: j_3 = i_11 + j_10;

vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: vect_is_simple_use: operand j_10
vector.cpp:11: note: def_stmt: j_10 = PHI <j_3(4), 0(2)>

vector.cpp:11: note: type of def: 5.
vector.cpp:11: note: detected reduction: j_3 = i_11 + j_10;

vector.cpp:11: note: reduc op not supported by target.
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) intvect_model_reduction_cost: inside_cost = 1, prologue_cost = 1, epilogue_cost = 5 .
vector.cpp:11: note: ==> examining statement: i_4 = i_11 + 1;

vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(2)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: vect_is_simple_use: operand 1
vector.cpp:11: note: === vectorizable_operation ===
vector.cpp:11: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 1 .
vector.cpp:11: note: ==> examining statement: ivtmp_1 = ivtmp_2 - 1;

vector.cpp:11: note: irrelevant.
vector.cpp:11: note: ==> examining statement: if (ivtmp_1 != 0)

vector.cpp:11: note: irrelevant.
vector.cpp:11: note: vectorization_factor = 4, niters = 1024
vector.cpp:11: note: === vect_update_slp_costs_according_to_vf ===
vector.cpp:11: note: Cost model analysis:
  Vector inside of loop cost: 3
  Vector prologue cost: 4
  Vector epilogue cost: 5
  Scalar iteration cost: 2
  Scalar outside cost: 0
  Vector outside cost: 9
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 8

vector.cpp:11: note:   Runtime profitability threshold = 7

vector.cpp:11: note:   Static estimate profitability threshold = 7


Vectorizing loop at vector.cpp:11

vector.cpp:11: note: === vec_transform_loop ===
vector.cpp:11: note: ------>vectorizing phi: j_10 = PHI <j_3(4), 0(6)>

vector.cpp:11: note: ------>vectorizing phi: i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: transform phi.
vector.cpp:11: note: transform induction phi.
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: created new init_stmt: stmp_var_.3_9 = 0 + 1;

vector.cpp:11: note: created new init_stmt: stmp_var_.3_8 = stmp_var_.3_9 + 1;

vector.cpp:11: note: created new init_stmt: stmp_var_.3_12 = stmp_var_.3_8 + 1;

vector.cpp:11: note: created new init_stmt: vect_cst_.4_13 = {0, stmp_var_.3_9, stmp_var_.3_8, stmp_var_.3_12};

vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: created new init_stmt: vect_cst_.5_14 = { 4, 4, 4, 4 };

vector.cpp:11: note: transform induction: created def-use cycle: vect_vec_iv_.6_15 = PHI <vect_vec_iv_.6_16(4), vect_cst_.4_13(6)>

vect_vec_iv_.6_16 = vect_vec_iv_.6_15 + vect_cst_.5_14;

vector.cpp:11: note: ------>vectorizing phi: ivtmp_2 = PHI <ivtmp_1(4), 1024(6)>

vector.cpp:11: note: ------>vectorizing phi: vect_vec_iv_.6_15 = PHI <vect_vec_iv_.6_16(4), vect_cst_.4_13(6)>

vector.cpp:11: note: ------>vectorizing statement: vect_vec_iv_.6_16 = vect_vec_iv_.6_15 + vect_cst_.5_14;

vector.cpp:11: note: ------>vectorizing statement: j_3 = i_11 + j_10;

vector.cpp:11: note: transform statement.
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: vect_is_simple_use: operand j_10
vector.cpp:11: note: def_stmt: j_10 = PHI <j_3(4), 0(6)>

vector.cpp:11: note: type of def: 5.
vector.cpp:11: note: detected reduction: j_3 = i_11 + j_10;

vector.cpp:11: note: reduc op not supported by target.
vector.cpp:11: note: transform reduction.
vector.cpp:11: note: vect_get_vec_def_for_operand: i_11
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: def =  i_11  def_stmt =  i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: add new stmt: vect_j.7_18 = vect_vec_iv_.6_15 + vect_j.7_17;

vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: vect_get_vec_def_for_operand: j_10
vector.cpp:11: note: vect_is_simple_use: operand j_10
vector.cpp:11: note: def_stmt: j_10 = PHI <j_3(4), 0(6)>

vector.cpp:11: note: type of def: 5.
vector.cpp:11: note: def =  j_10  def_stmt =  j_10 = PHI <j_3(4), 0(6)>

vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: transform reduction: created def-use cycle: vect_j.7_17 = PHI <vect_j.7_18(4), { 0, 0, 0, 0 }(6)>

vect_j.7_18 = vect_vec_iv_.6_15 + vect_j.7_17;

vector.cpp:11: note: Reduce using vector shifts
vector.cpp:11: note: extract scalar result
vector.cpp:11: note: ------>vectorizing statement: i_4 = i_11 + 1;

vector.cpp:11: note: transform statement.
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: vect_is_simple_use: operand 1
vector.cpp:11: note: transform binary/unary operation.
vector.cpp:11: note: vect_get_vec_def_for_operand: i_11
vector.cpp:11: note: vect_is_simple_use: operand i_11
vector.cpp:11: note: def_stmt: i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: type of def: 4.
vector.cpp:11: note: def =  i_11  def_stmt =  i_11 = PHI <i_4(4), 0(6)>

vector.cpp:11: note: vect_get_vec_def_for_operand: 1
vector.cpp:11: note: vect_is_simple_use: operand 1
vector.cpp:11: note: get vectype with 4 units of type int
vector.cpp:11: note: vectype: vector(4) int
vector.cpp:11: note: Create vector_cst. nunits = 4
vector.cpp:11: note: created new init_stmt: vect_cst_.12_26 = { 1, 1, 1, 1 };

vector.cpp:11: note: add new stmt: vect_i.11_27 = vect_vec_iv_.6_15 + vect_cst_.12_26;

vector.cpp:11: note: ------>vectorizing statement: ivtmp_1 = ivtmp_2 - 1;

vector.cpp:11: note: ------>vectorizing statement: if (ivtmp_1 != 0)

loop at vector.cpp:12: if (ivtmp_29 < 256)

vector.cpp:11: note: LOOP VECTORIZED.
vector.cpp:8: note: vectorized 1 loops in function.

vector.cpp:8: note: ===vect_slp_analyze_bb===

vector.cpp:8: note: === vect_analyze_data_refs ===

vector.cpp:8: note: not vectorized: not enough data-refs in basic block.

vector.cpp:12: note: ===vect_slp_analyze_bb===

vector.cpp:12: note: === vect_analyze_data_refs ===

vector.cpp:12: note: not vectorized: not enough data-refs in basic block.

vector.cpp:8: note: ===vect_slp_analyze_bb===

vector.cpp:8: note: === vect_analyze_data_refs ===

vector.cpp:8: note: not vectorized: not enough data-refs in basic block.

vector.cpp:14: note: ===vect_slp_analyze_bb===

vector.cpp:14: note: === vect_analyze_data_refs ===

vector.cpp:14: note: not vectorized: not enough data-refs in basic block.

内容太多了,精简参数再次编译:

[xiaochu.yh ~/tools/vector] $g++ vector.cpp -O3 -fopt-info-vec-optimized -fopt-info-vec-missed

Analyzing loop at vector.cpp:11

vector.cpp:11: note: step unknown.
vector.cpp:11: note: reduc phi. skip.
vector.cpp:11: note: reduc op not supported by target.

Vectorizing loop at vector.cpp:11

vector.cpp:11: note: reduc op not supported by target.
vector.cpp:11: note: LOOP VECTORIZED.
vector.cpp:8: note: vectorized 1 loops in function.

vector.cpp:8: note: not vectorized: not enough data-refs in basic block.

vector.cpp:12: note: not vectorized: not enough data-refs in basic block.

vector.cpp:8: note: not vectorized: not enough data-refs in basic block.

vector.cpp:14: note: not vectorized: not enough data-refs in basic block.

继续精简:

[xiaochu.yh ~/tools/vector] $g++ vector.cpp -O3 -fopt-info-vec-optimized

Analyzing loop at vector.cpp:11


Vectorizing loop at vector.cpp:11

vector.cpp:11: note: LOOP VECTORIZED.
vector.cpp:8: note: vectorized 1 loops in function.

再次贴一下代码,11 行被 vectorized,不知道具体是做了什么呢?

  1 // Copyright 1999-2023 Alibaba Inc. All Rights Reserved.
  2 // Author:
  3 //   xiaochu.yh@alipay
  4 //
  5
  6 #include <stdio.h>
  7
  8 int main(int argc, const char *argv[])
  9 {
 10   int j = 0;
 11   for (int i = 0; i < 1024; ++i) {
 12     j += i;
 13   }
 14   printf("j = %d\n", j);
 15   return 0;
 16 }

典型例子

下面例子中,三个循环都被 SIMD 了。如果把循环次数改成 3,则不会做 SIMD。所以,编译期后端会有一个类似 SQL 优化器的东西来计算优化前后代价,选择代价更小的作为最终输出。

[xiaochu.yh ~/tools/vector] $g++ vector.cpp -O3 -fopt-info-vec-optimized

Analyzing loop at vector.cpp:13


Vectorizing loop at vector.cpp:13

vector.cpp:13: note: LOOP VECTORIZED.
Analyzing loop at vector.cpp:10


Vectorizing loop at vector.cpp:10

vector.cpp:10: note: LOOP VECTORIZED.
Analyzing loop at vector.cpp:7


Vectorizing loop at vector.cpp:7

vector.cpp:7: note: LOOP VECTORIZED.
vector.cpp:3: note: vectorized 3 loops in function.

[xiaochu.yh ~/tools/vector] $cat vector.cpp | nl
     1  #include <stdio.h>
     2  int main(int argc, const char *argv[])
     3  {
     4    int result[32];
     5    int j = 0;
     6    for (int i = 0; i < 32; ++i) {
     7      result[i] = i;
     8    }
     9    for (int i = 0; i < 32; ++i) {
    10      result[i] = i + result[i];
    11    }
    12    for (int i = 0; i < 32; ++i) {
    13      j += result[i];
    14    }
    15    printf("%d\n", j);
    16    return 0;
    17  }
    ```
   
## 优化级别
测试中发现,必须用 -O3  才会开启自动向量化。 
![在这里插入图片描述](https://img-blog.csdnimg/9abd99612880463697c9ca589dc3e2b0.png)
CHATGPT: 在使用 GCC 编译 C++ 代码时开启自动向量化,可以通过添加 -ftree-vectorize 编译选项实现。该选项会启用树形优化器,自动将代码向量化以提高执行效率。此外,你还可以添加 -O3 选项以启用更高级别的优化。完整的编译命令如下所示:
 

本文标签: 代码SIMD