为什么std :: pair比std :: tuple更快

本文介绍了为什么std :: pair比std :: tuple更快的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

这是测试代码.

元组测试:

using namespace std; int main(){ vector<tuple<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_tuple(var, var)); } }

配对测试:

#include <vector> using namespace std; int main(){ vector<pair<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_pair(var, var)); } }

我通过Linux time命令进行了时间测量. 结果是:

| | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 8.9 s | 1.60 s | | Tuple | 19.8 s | 1.96 s |

我想知道，为什么O0中的这两个数据结构之间会有如此大的差异，因为它们应该非常相似. 02的差别很小.

为什么O0的差异如此之大，为什么根本没有差异?

带有v.resize()的代码

配对:

#include <vector> using namespace std; int main(){ vector<pair<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_pair(var, var); } }

元组:

#include<tuple> #include<vector> using namespace std; int main(){ vector<tuple<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_tuple(var, var); } }

结果:

| | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 5.01 s | 0.77 s | | Tuple | 10.6 s | 0.87 s |

我的系统

g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7) GLIBCXX_3.4.19

解决方案

您缺少一些关键信息:您使用什么编译器?您用什么来衡量微基准测试的性能?您使用哪种标准库实现?

我的系统:

g++ (GCC) 4.9.1 20140903 (prerelease) GLIBCXX_3.4.20

无论如何，我运行了您的示例，但首先保留了向量的适当大小，以消除内存分配的开销.这样，我很有趣地观察到了相反的有趣的事情-与您所看到的相反:

g++ -std=c++11 -O2 pair.cpp -o pair perf stat -r 10 -d ./pair Performance counter stats for './pair' (10 runs): 1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% ) 346 context-switches:HG # 0.210 K/sec ( +- 40.13% ) 7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% ) 182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% ) 3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%] 2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%] 1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%] 2,837,484,461 instructions:HG # 0.84 insns per cycle # 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%] 526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%] 829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%] 594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%] 20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%] 5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%] <not supported> LLC-load-misses:HG 1.658671368 seconds time elapsed ( +- 1.82% )

与之相对:

g++ -std=c++11 -O2 tuple.cpp -o tuple perf stat -r 10 -d ./tuple Performance counter stats for './tuple' (10 runs): 996.090514 task-clock:HG (msec) # 0.996 CPUs utilized ( +- 2.41% ) 102 context-switches:HG # 0.102 K/sec ( +- 64.61% ) 4 cpu-migrations:HG # 0.004 K/sec ( +- 32.24% ) 181,701 page-faults:HG # 0.182 M/sec ( +- 0.06% ) 2,052,505,223 cycles:HG # 2.061 GHz ( +- 2.22% ) [44.45%] 1,212,930,513 stalled-cycles-frontend:HG # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%] 621,104,447 stalled-cycles-backend:HG # 30.26% backend cycles idle ( +- 3.48% ) [44.69%] 2,700,410,991 instructions:HG # 1.32 insns per cycle # 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%] 486,476,408 branches:HG # 488.386 M/sec ( +- 1.70% ) [55.96%] 959,651 branch-misses:HG # 0.20% of all branches ( +- 4.78% ) [55.82%] 547,000,119 L1-dcache-loads:HG # 549.147 M/sec ( +- 2.19% ) [55.67%] 21,540,926 L1-dcache-load-misses:HG # 3.94% of all L1-dcache hits ( +- 2.73% ) [55.43%] 5,751,650 LLC-loads:HG # 5.774 M/sec ( +- 3.60% ) [44.21%] <not supported> LLC-load-misses:HG 1.000126894 seconds time elapsed ( +- 2.47% )

如您所见，在我的案例中，原因是前端和后端的停滞周期数量都大大增加.

现在这是哪里来的?我敢打赌，这归因于一些失败的内联，类似于此处的解释: std ::启用C ++ 11时的向量性能回归

实际上，启用-flto对我来说等于结果:

Performance counter stats for './pair' (10 runs): 1021.922944 task-clock:HG (msec) # 0.997 CPUs utilized ( +- 1.15% ) 63 context-switches:HG # 0.062 K/sec ( +- 77.23% ) 5 cpu-migrations:HG # 0.005 K/sec ( +- 34.21% ) 195,396 page-faults:HG # 0.191 M/sec ( +- 0.00% ) 2,109,877,147 cycles:HG # 2.065 GHz ( +- 0.92% ) [44.33%] 1,098,031,078 stalled-cycles-frontend:HG # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%] 701,553,535 stalled-cycles-backend:HG # 33.25% backend cycles idle ( +- 1.09% ) [44.68%] 3,288,420,630 instructions:HG # 1.56 insns per cycle # 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%] 672,941,736 branches:HG # 658.505 M/sec ( +- 0.80% ) [56.00%] 660,278 branch-misses:HG # 0.10% of all branches ( +- 2.05% ) [55.93%] 474,314,267 L1-dcache-loads:HG # 464.139 M/sec ( +- 1.32% ) [55.73%] 19,481,787 L1-dcache-load-misses:HG # 4.11% of all L1-dcache hits ( +- 0.80% ) [55.51%] 5,155,678 LLC-loads:HG # 5.045 M/sec ( +- 1.69% ) [44.21%] <not supported> LLC-load-misses:HG 1.025083895 seconds time elapsed ( +- 1.03% )

和元组:

Performance counter stats for './tuple' (10 runs): 1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% ) 8 context-switches:HG # 0.008 K/sec ( +- 29.74% ) 3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% ) 195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% ) 2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%] 1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%] 697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%] 3,305,631,646 instructions:HG # 1.57 insns per cycle # 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%] 675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%] 656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%] 475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%] 19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%] 5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%] <not supported> LLC-load-misses:HG 1.020225388 seconds time elapsed ( +- 0.48% )

因此请记住，-flto是您的朋友，并且内联失败会在大量模板化代码上产生极端结果.使用perf stat找出正在发生的事情.

Here is the code for testing.

Tuple test:

using namespace std; int main(){ vector<tuple<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_tuple(var, var)); } }

Pair test:

#include <vector> using namespace std; int main(){ vector<pair<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_pair(var, var)); } }

I did the time measurement via Linux time command. The results are:

| | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 8.9 s | 1.60 s | | Tuple | 19.8 s | 1.96 s |

I am wondering, why is such a big difference between those two data structures in O0, as they should be very similar. There is just a small difference in 02.

Why is the difference in O0 so big, and why is there any difference at all?

EDIT:

The code with v.resize()

Pair:

#include <vector> using namespace std; int main(){ vector<pair<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_pair(var, var); } }

Tuple:

#include<tuple> #include<vector> using namespace std; int main(){ vector<tuple<int,int>> v; v.resize(100000000); for (int var = 0; var < 100000000; ++var) { v[var] = make_tuple(var, var); } }

Results:

| | -O0 | -O2 | |:------|:-------:|:--------:| | Pair | 5.01 s | 0.77 s | | Tuple | 10.6 s | 0.87 s |

EDIT:

My system

g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7) GLIBCXX_3.4.19

解决方案

You are missing some crucial information: What compiler do you use? What do you use to measure the performance of the microbenchmark? What standard library implementation do you use?

My system:

g++ (GCC) 4.9.1 20140903 (prerelease) GLIBCXX_3.4.20

Anyhow, I ran your examples, but reserved the proper size of the vectors first to get rid of the memory allocation overhead. With that, I funnily observe the opposite something interesting - the reverse of what you see:

g++ -std=c++11 -O2 pair.cpp -o pair perf stat -r 10 -d ./pair Performance counter stats for './pair' (10 runs): 1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% ) 346 context-switches:HG # 0.210 K/sec ( +- 40.13% ) 7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% ) 182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% ) 3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%] 2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%] 1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%] 2,837,484,461 instructions:HG # 0.84 insns per cycle # 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%] 526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%] 829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%] 594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%] 20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%] 5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%] <not supported> LLC-load-misses:HG 1.658671368 seconds time elapsed ( +- 1.82% )

versus:

g++ -std=c++11 -O2 tuple.cpp -o tuple perf stat -r 10 -d ./tuple Performance counter stats for './tuple' (10 runs): 996.090514 task-clock:HG (msec) # 0.996 CPUs utilized ( +- 2.41% ) 102 context-switches:HG # 0.102 K/sec ( +- 64.61% ) 4 cpu-migrations:HG # 0.004 K/sec ( +- 32.24% ) 181,701 page-faults:HG # 0.182 M/sec ( +- 0.06% ) 2,052,505,223 cycles:HG # 2.061 GHz ( +- 2.22% ) [44.45%] 1,212,930,513 stalled-cycles-frontend:HG # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%] 621,104,447 stalled-cycles-backend:HG # 30.26% backend cycles idle ( +- 3.48% ) [44.69%] 2,700,410,991 instructions:HG # 1.32 insns per cycle # 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%] 486,476,408 branches:HG # 488.386 M/sec ( +- 1.70% ) [55.96%] 959,651 branch-misses:HG # 0.20% of all branches ( +- 4.78% ) [55.82%] 547,000,119 L1-dcache-loads:HG # 549.147 M/sec ( +- 2.19% ) [55.67%] 21,540,926 L1-dcache-load-misses:HG # 3.94% of all L1-dcache hits ( +- 2.73% ) [55.43%] 5,751,650 LLC-loads:HG # 5.774 M/sec ( +- 3.60% ) [44.21%] <not supported> LLC-load-misses:HG 1.000126894 seconds time elapsed ( +- 2.47% )

as you can see, in my case the reason are the much higher number of stalled cycles, both in the frontend as well as in the backend.

Now where does this come from? I bet it comes down to some failed inlining, similar to what is explained here: std::vector performance regression when enabling C++11

Indeed, enabling -flto equalizes the results for me:

Performance counter stats for './pair' (10 runs): 1021.922944 task-clock:HG (msec) # 0.997 CPUs utilized ( +- 1.15% ) 63 context-switches:HG # 0.062 K/sec ( +- 77.23% ) 5 cpu-migrations:HG # 0.005 K/sec ( +- 34.21% ) 195,396 page-faults:HG # 0.191 M/sec ( +- 0.00% ) 2,109,877,147 cycles:HG # 2.065 GHz ( +- 0.92% ) [44.33%] 1,098,031,078 stalled-cycles-frontend:HG # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%] 701,553,535 stalled-cycles-backend:HG # 33.25% backend cycles idle ( +- 1.09% ) [44.68%] 3,288,420,630 instructions:HG # 1.56 insns per cycle # 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%] 672,941,736 branches:HG # 658.505 M/sec ( +- 0.80% ) [56.00%] 660,278 branch-misses:HG # 0.10% of all branches ( +- 2.05% ) [55.93%] 474,314,267 L1-dcache-loads:HG # 464.139 M/sec ( +- 1.32% ) [55.73%] 19,481,787 L1-dcache-load-misses:HG # 4.11% of all L1-dcache hits ( +- 0.80% ) [55.51%] 5,155,678 LLC-loads:HG # 5.045 M/sec ( +- 1.69% ) [44.21%] <not supported> LLC-load-misses:HG 1.025083895 seconds time elapsed ( +- 1.03% )

and for tuple:

Performance counter stats for './tuple' (10 runs): 1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% ) 8 context-switches:HG # 0.008 K/sec ( +- 29.74% ) 3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% ) 195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% ) 2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%] 1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%] 697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%] 3,305,631,646 instructions:HG # 1.57 insns per cycle # 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%] 675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%] 656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%] 475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%] 19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%] 5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%] <not supported> LLC-load-misses:HG 1.020225388 seconds time elapsed ( +- 0.48% )

So remember, -flto is your friend and failed inlining can have extreme results on heavily templated code. Use perf stat to find out what's happening.

更多推荐

为什么std :: pair比std :: tuple更快