用64位替换32位循环计数变量引入了疯狂的性能偏差(Replacing a 32

用64位替换32位循环计数变量引入了疯狂的性能偏差(Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations)

我正在寻找最快的方式来popcount大量数据。我遇到了一个非常奇怪的影响：将循环变量uint64_t unsigned更改为uint64_t使我的电脑的性能下降了50％。

基准测试

#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1; } uint64_t size = atol(argv[1])<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=0; i<size; ++i) charbuffer[i] = rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with unsigned for (unsigned i=0; i<size/8; i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { startP = chrono::system_clock::now(); count=0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with uint64_t for (uint64_t i=0;i<size/8;i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }

（绝对疯狂）的结果

我这样编译（g ++版本：Ubuntu 4.8.2-19ubuntu1）：

g++ -O3 -march=native -std=c++11 test.cpp -o test

更疯狂！

拿第一个例子（一个用atol(argv[1]) ），并在变量之前放一个static ，即：

static uint64_t size=atol(argv[1])<<20;

我的问题

你可以解释这些结果吗？特别：

u64和u64之间有什么区别？如何用常量缓冲区大小代替非常数触发较少的最优代码 ？如何插入static关键字使u64循环更快？甚至比我的同事的电脑上的原始代码快！

我知道优化是一个棘手的领域，但是我从来没有想过这么小的变化可能会导致执行时间有100％的差异 ，像一般的缓冲区大小这样的小的因素可以再次混合使用结果。当然，我一直想要的版本能够弹出26 GB / s。我可以想到的唯一可靠的方法是复制粘贴这种情况下的组件，并使用内联汇编。这是我可以摆脱似乎在微小变化上变得疯狂的编译器的唯一方法。你怎么看？是否有另一种方式可靠地获取最具性能的代码？

拆卸

以下是各种结果的拆卸：

来自g ++ / u32 / non-const bufsize的 26 GB / s版本：

0x400af8: lea 0x1(%rdx),%eax popcnt (%rbx,%rax,8),%r9 lea 0x2(%rdx),%edi popcnt (%rbx,%rcx,8),%rax lea 0x3(%rdx),%esi add %r9,%rax popcnt (%rbx,%rdi,8),%rcx add $0x4,%edx add %rcx,%rax popcnt (%rbx,%rsi,8),%rcx add %rcx,%rax mov %edx,%ecx add %rax,%r14 cmp %rbp,%rcx jb 0x400af8

得到教训

不管这个问题的答案如何？我已经知道，在真正的热循环中，每个细节都很重要， 即使是与热代码似乎没有任何关联的细节 。我从来没有想过要使用什么类型的循环变量，但是看到这样一个小的变化可以使100％的差异！即使缓冲区的存储类型也可能产生巨大的差异，正如我们在尺寸变量前面插入static关键字一样！将来，在编写对于系统性能至关重要的紧密和热循环时，我将始终在各种编译器上测试各种替代方案。

有趣的是，性能差异仍然很高，尽管我已经展开了四次循环。所以即使你展开，你仍然可以受到主要的性能偏差的打击。很有趣。

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.

The Benchmark

#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1; } uint64_t size = atol(argv[1])<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=0; i<size; ++i) charbuffer[i] = rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with unsigned for (unsigned i=0; i<size/8; i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { startP = chrono::system_clock::now(); count=0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with uint64_t for (uint64_t i=0;i<size/8;i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }

The (absolutely crazy) results

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

More madness, please!

Take the first example (the one with atol(argv[1])) and put a static before the variable, i.e.:

static uint64_t size=atol(argv[1])<<20;

My question

Can you explain these results? Especially:

How can there be such a difference between u32 and u64? How can replacing a non-constant by a constant buffer size trigger less optimal code? How can the insertion of the static keyword make the u64 loop faster? Even faster than the original code on my collegue's computer!

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?

The Disassembly

Here is the disassembly for the various results:

26 GB/s version from g++ / u32 / non-const bufsize:

0x400af8: lea 0x1(%rdx),%eax popcnt (%rbx,%rax,8),%r9 lea 0x2(%rdx),%edi popcnt (%rbx,%rcx,8),%rax lea 0x3(%rdx),%esi add %r9,%rax popcnt (%rbx,%rdi,8),%rcx add $0x4,%edx add %rcx,%rax popcnt (%rbx,%rsi,8),%rcx add %rcx,%rax mov %edx,%ecx add %rax,%r14 cmp %rbp,%rcx jb 0x400af8

Lessons learned

No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.

最满意答案

Culprit：False Data Dependency （编译器甚至不知道它）

在Sandy / Ivy Bridge和Haswell处理器上，说明：

popcnt src, dest

似乎对目标寄存器dest有错误的依赖关系。即使指令只写入指令，在执行前，指令将等到dest为准备好。

这种依赖关系不仅仅是阻止来自单个循环迭代的4个popcnt 。它可以进行循环迭代，使得处理器不可能并行化不同的循环迭代。

unsigned对uint64_t和其他调整不直接影响问题。但它们影响将寄存器分配给变量的寄存器分配器。

在你的情况下，速度是根据寄存器分配器决定做什么而直接导致（false）依赖链的直接结果。

13 GB / s有一个链： popcnt - add - popcnt - popcnt →下一个迭代 15 GB / s有一个链： popcnt - add - popcnt - add →下一个迭代 20 GB / s有一个链： popcnt - popcnt →下一个迭代 26 GB / s有一个链： popcnt - popcnt →下一个迭代

20 GB / s和26 GB / s之间的差异似乎是间接寻址中的一个小问题。无论哪种方式，一旦达到这个速度，处理器就会开始遇到其他瓶颈。

为了测试这个，我使用内联汇编来绕过编译器，并得到我想要的程序集。我也拆分了count变量来打破所有其他可能会混淆基准的依赖关系。

结果如下：

Sandy Bridge Xeon @ 3.5 GHz：（底部有完整的测试代码）

GCC 4.6.3： g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native Ubuntu 12

不同的寄存器： 18.6195 GB / s

.L4: movq (%rbx,%rax,8), %r8 movq 8(%rbx,%rax,8), %r9 movq 16(%rbx,%rax,8), %r10 movq 24(%rbx,%rax,8), %r11 addq $4, %rax popcnt %r8, %r8 add %r8, %rdx popcnt %r9, %r9 add %r9, %rcx popcnt %r10, %r10 add %r10, %rdi popcnt %r11, %r11 add %r11, %rsi cmpq $131072, %rax jne .L4

同名注册： 8.49272 GB / s

.L9: movq (%rbx,%rdx,8), %r9 movq 8(%rbx,%rdx,8), %r10 movq 16(%rbx,%rdx,8), %r11 movq 24(%rbx,%rdx,8), %rbp addq $4, %rdx # This time reuse "rax" for all the popcnts. popcnt %r9, %rax add %rax, %rcx popcnt %r10, %rax add %rax, %rsi popcnt %r11, %rax add %rax, %r8 popcnt %rbp, %rax add %rax, %rdi cmpq $131072, %rdx jne .L9

破损链同注册号： 17.8869 GB / s

.L14: movq (%rbx,%rdx,8), %r9 movq 8(%rbx,%rdx,8), %r10 movq 16(%rbx,%rdx,8), %r11 movq 24(%rbx,%rdx,8), %rbp addq $4, %rdx # Reuse "rax" for all the popcnts. xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax". popcnt %r9, %rax add %rax, %rcx popcnt %r10, %rax add %rax, %rsi popcnt %r11, %rax add %rax, %r8 popcnt %rbp, %rax add %rax, %rdi cmpq $131072, %rdx jne .L14

那么编译器出了什么问题？

看来，GCC和Visual Studio都不知道popcnt有这样的错误依赖。然而，这些虚假依赖并不罕见。这只是编译器是否知道的问题。

popcnt不是最常用的指令。所以一个主要的编译器可能会错过这样的事情并不奇怪。在任何地方，似乎没有任何提及这个问题的文件。如果英特尔没有透露，那么任何人都不会知道，直到有人遇到机会。

（ 更新： 从版本4.9.2起，GCC了解这种虚假依赖关系，并在优化时生成代码以进行补偿。来自其他供应商的主要编译器，包括Clang，MSVC，甚至英特尔自己的ICC尚未注意到这个微架构勘误并不会发出补偿代码。）

为什么CPU有这样的错误依赖？

我们只能推测，但英特尔很可能会对很多双操作数指令进行相同的处理。常见的说明如add ， sub两个操作数都是输入。所以英特尔可能把popcnt推到同一个类别，以保持处理器的设计简单。

AMD处理器似乎没有这种错误依赖。

完整的测试代码如下：

#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; uint64_t size=1<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer=reinterpret_cast<char*>(buffer); for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { uint64_t c0 = 0; uint64_t c1 = 0; uint64_t c2 = 0; uint64_t c3 = 0; startP = chrono::system_clock::now(); for( unsigned k = 0; k < 10000; k++){ for (uint64_t i=0;i<size/8;i+=4) { uint64_t r0 = buffer[i + 0]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + 3]; __asm__( "popcnt %4, %4 \n\t" "add %4, %0 \n\t" "popcnt %5, %5 \n\t" "add %5, %1 \n\t" "popcnt %6, %6 \n\t" "add %6, %2 \n\t" "popcnt %7, %7 \n\t" "add %7, %3 \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) ); } } count = c0 + c1 + c2 + c3; endP = chrono::system_clock::now(); duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { uint64_t c0 = 0; uint64_t c1 = 0; uint64_t c2 = 0; uint64_t c3 = 0; startP = chrono::system_clock::now(); for( unsigned k = 0; k < 10000; k++){ for (uint64_t i=0;i<size/8;i+=4) { uint64_t r0 = buffer[i + 0]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + 3]; __asm__( "popcnt %4, %%rax \n\t" "add %%rax, %0 \n\t" "popcnt %5, %%rax \n\t" "add %%rax, %1 \n\t" "popcnt %6, %%rax \n\t" "add %%rax, %2 \n\t" "popcnt %7, %%rax \n\t" "add %%rax, %3 \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) : "rax" ); } } count = c0 + c1 + c2 + c3; endP = chrono::system_clock::now(); duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { uint64_t c0 = 0; uint64_t c1 = 0; uint64_t c2 = 0; uint64_t c3 = 0; startP = chrono::system_clock::now(); for( unsigned k = 0; k < 10000; k++){ for (uint64_t i=0;i<size/8;i+=4) { uint64_t r0 = buffer[i + 0]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + 3]; __asm__( "xor %%rax, %%rax \n\t" // <--- Break the chain. "popcnt %4, %%rax \n\t" "add %%rax, %0 \n\t" "popcnt %5, %%rax \n\t" "add %%rax, %1 \n\t" "popcnt %6, %%rax \n\t" "add %%rax, %2 \n\t" "popcnt %7, %%rax \n\t" "add %%rax, %3 \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) : "rax" ); } } count = c0 + c1 + c2 + c3; endP = chrono::system_clock::now(); duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }

一个同样有趣的基准可以在这里找到： http : //pastebin.com/kbzgL8si 此基准会改变（false）依赖关系链中的popcnt的数量。

False Chain 0: 41959360000 0.57748 sec 18.1578 GB/s False Chain 1: 41959360000 0.585398 sec 17.9122 GB/s False Chain 2: 41959360000 0.645483 sec 16.2448 GB/s False Chain 3: 41959360000 0.929718 sec 11.2784 GB/s False Chain 4: 41959360000 1.23572 sec 8.48557 GB/s

Culprit: False Data Dependency (and the compiler isn't even aware of it)

On Sandy/Ivy Bridge and Haswell processors, the instruction:

popcnt src, dest

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.

This dependency doesn't just hold up the 4 popcnts from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.

The unsigned vs. uint64_t and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.

In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.

13 GB/s has a chain: popcnt-add-popcnt-popcnt → next iteration 15 GB/s has a chain: popcnt-add-popcnt-add → next iteration 20 GB/s has a chain: popcnt-popcnt → next iteration 26 GB/s has a chain: popcnt-popcnt → next iteration

The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.

To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count variable to break all other dependencies that might mess with the benchmarks.

Here are the results:

Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)

GCC 4.6.3: g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native Ubuntu 12

Different Registers: 18.6195 GB/s

.L4: movq (%rbx,%rax,8), %r8 movq 8(%rbx,%rax,8), %r9 movq 16(%rbx,%rax,8), %r10 movq 24(%rbx,%rax,8), %r11 addq $4, %rax popcnt %r8, %r8 add %r8, %rdx popcnt %r9, %r9 add %r9, %rcx popcnt %r10, %r10 add %r10, %rdi popcnt %r11, %r11 add %r11, %rsi cmpq $131072, %rax jne .L4

Same Register: 8.49272 GB/s

.L9: movq (%rbx,%rdx,8), %r9 movq 8(%rbx,%rdx,8), %r10 movq 16(%rbx,%rdx,8), %r11 movq 24(%rbx,%rdx,8), %rbp addq $4, %rdx # This time reuse "rax" for all the popcnts. popcnt %r9, %rax add %rax, %rcx popcnt %r10, %rax add %rax, %rsi popcnt %r11, %rax add %rax, %r8 popcnt %rbp, %rax add %rax, %rdi cmpq $131072, %rdx jne .L9

Same Register with broken chain: 17.8869 GB/s

.L14: movq (%rbx,%rdx,8), %r9 movq 8(%rbx,%rdx,8), %r10 movq 16(%rbx,%rdx,8), %r11 movq 24(%rbx,%rdx,8), %rbp addq $4, %rdx # Reuse "rax" for all the popcnts. xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax". popcnt %r9, %rax add %rax, %rcx popcnt %r10, %rax add %rax, %rsi popcnt %r11, %rax add %rax, %r8 popcnt %rbp, %rax add %rax, %rdi cmpq $131072, %rdx jne .L14

So what went wrong with the compiler?

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.

popcnt isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.

(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)

Why does the CPU have such a false dependency?

We can only speculate, but it's likely that Intel has the same handling for a lot of two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

AMD processors do not appear to have this false dependency.

The full test code is below for reference:

#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; uint64_t size=1<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer=reinterpret_cast<char*>(buffer); for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { uint64_t c0 = 0; uint64_t c1 = 0; uint64_t c2 = 0; uint64_t c3 = 0; startP = chrono::system_clock::now(); for( unsigned k = 0; k < 10000; k++){ for (uint64_t i=0;i<size/8;i+=4) { uint64_t r0 = buffer[i + 0]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + 3]; __asm__( "popcnt %4, %4 \n\t" "add %4, %0 \n\t" "popcnt %5, %5 \n\t" "add %5, %1 \n\t" "popcnt %6, %6 \n\t" "add %6, %2 \n\t" "popcnt %7, %7 \n\t" "add %7, %3 \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) ); } } count = c0 + c1 + c2 + c3; endP = chrono::system_clock::now(); duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { uint64_t c0 = 0; uint64_t c1 = 0; uint64_t c2 = 0; uint64_t c3 = 0; startP = chrono::system_clock::now(); for( unsigned k = 0; k < 10000; k++){ for (uint64_t i=0;i<size/8;i+=4) { uint64_t r0 = buffer[i + 0]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + 3]; __asm__( "popcnt %4, %%rax \n\t" "add %%rax, %0 \n\t" "popcnt %5, %%rax \n\t" "add %%rax, %1 \n\t" "popcnt %6, %%rax \n\t" "add %%rax, %2 \n\t" "popcnt %7, %%rax \n\t" "add %%rax, %3 \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) : "rax" ); } } count = c0 + c1 + c2 + c3; endP = chrono::system_clock::now(); duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { uint64_t c0 = 0; uint64_t c1 = 0; uint64_t c2 = 0; uint64_t c3 = 0; startP = chrono::system_clock::now(); for( unsigned k = 0; k < 10000; k++){ for (uint64_t i=0;i<size/8;i+=4) { uint64_t r0 = buffer[i + 0]; uint64_t r1 = buffer[i + 1]; uint64_t r2 = buffer[i + 2]; uint64_t r3 = buffer[i + 3]; __asm__( "xor %%rax, %%rax \n\t" // <--- Break the chain. "popcnt %4, %%rax \n\t" "add %%rax, %0 \n\t" "popcnt %5, %%rax \n\t" "add %%rax, %1 \n\t" "popcnt %6, %%rax \n\t" "add %%rax, %2 \n\t" "popcnt %7, %%rax \n\t" "add %%rax, %3 \n\t" : "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3) : "r" (r0), "r" (r1), "r" (r2), "r" (r3) : "rax" ); } } count = c0 + c1 + c2 + c3; endP = chrono::system_clock::now(); duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }

An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si This benchmark varies the number of popcnts that are in the (false) dependency chain.

False Chain 0: 41959360000 0.57748 sec 18.1578 GB/s False Chain 1: 41959360000 0.585398 sec 17.9122 GB/s False Chain 2: 41959360000 0.645483 sec 16.2448 GB/s False Chain 3: 41959360000 0.929718 sec 11.2784 GB/s False Chain 4: 41959360000 1.23572 sec 8.48557 GB/s

更多推荐