SSE指令：哪些CPU可以执行原子16B内存操作？

本文介绍了SSE指令：哪些CPU可以执行原子16B内存操作？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

考虑x86 CPU上的单个存储器访问（单个读取或单个写入，而不是读取+写入）SSE指令。该指令正在访问16字节（128位）的内存，访问的内存位置与16个字节对齐。

文档英特尔®64架构内存订购白皮书表示对于读取或写入其地址在8字节边界对齐的四字（8字节）的指令，存储器操作看起来作为单个存储器访问而执行，而不管存储器类型。

问题：是否存在Intel / AMD / etc x86 CPU，它们保证读或写16字节（128位）对齐到16字节边界执行单个存储器访问？是这样的，哪种特定类型的CPU是它（Core2 / Atom / K8 / Phenom / ...）？如果您对此问题提供答案（是/否），则还请指定用于确定答案的方法 - PDF文档查找，强力测试，数学证明或任何其他方法用于确定答案。

此问题涉及问题，例如 research.swtch/2010/02/off-to-races.html

更新：

我在C中创建了一个简单的测试程序，可以在计算机上运行。请编译并运行它在你的Phenom，Athlon，Bobcat，Core2，Atom，Sandy Bridge或任何具有SSE2能力的CPU。感谢。

//编译： // gcc -oa ac -pthread -msse2 -std = c99 - Wall -O2 // //确保您至少有两个物理CPU内核或超线程。 #include< pthread.h> #include< emmintrin.h> #include< stdio.h> #include< stdint.h> #include< string.h> typedef int v4si __attribute__（（vector_size（16）））; volatile v4si x; unsigned n1 [16] __attribute __（（aligned（64）））; unsigned n2 [16] __attribute __（（aligned（64）））; void * thread1（void * arg）{ for（int i = 0; i< 100 * 1000 * 1000; i ++）{ int mask = _mm_movemask_ps ）X）; n1 [mask] ++; x =（v4si）{0,0,0,0}; } return NULL; } void * thread2（void * arg）{ for（int i = 0; i <100 * 1000 * 1000; i ++）{ = _mm_movemask_ps（（__ m128）x）; n2 [mask] ++; x =（v4si）{ - 1，-1，-1，-1}; } return NULL; } int main（）{ //检查内存对齐 if（（（uintptr_t）& x）& 0x0f！ abort（）; memset（n1，0，sizeof（n1））; memset（n2，0，sizeof（n2））; pthread_t t1，t2; pthread_create（& t1，NULL，thread1，NULL）; pthread_create（& t2，NULL，thread2，NULL）; pthread_join（t1，NULL）; pthread_join（t2，NULL）; for（unsigned i = 0; i j）& 1）; printf（％10u％10u，n1 [i]，n2 [i]）; if（i> 0& i <0x0f）{ if（n1 [i] || n2 [i]） printf（Not one single memory access！ ; } printf（\\\）; } return 0; }

我在我的笔记本电脑上的CPU是Core Duo（不是Core2）。这个特定的CPU测试失败，它以8字节的粒度实现16字节的存储器读/写。输出为：

0000 96905702 10512 0001 0 0 0010 0 0 0011 22 12924不是单个内存访问！ 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 3092557 1175不是单个内存访问！ 1101 0 0 1110 0 0 1111 1719 99975389

解决方案

在英特尔®64和IA-32体系结构开发人员手册： 3A ，它现在包含您提到的记忆顺序白皮书的规格，在8.2.3.1节中说，您注意到

Intel-64内存排序模型保证，对于以下每个内存访问指令，组成内存操作似乎执行作为单个内存访问： •读取或写入单个字节的指令。 •读或写一个字（2字节）的指令，其地址在2 字节边界对齐。 •读取或写入双字（4字节）的指令，其地址在4字节边界对齐。 •读取或写入四字（8字节）的指令，其地址在（8字节边界）对齐。任何锁定指令（XCHG指令或另一个具有LOCK前缀的读 - 修改 - 写指令）似乎执行为不可分割和不间断负载序列），然后是存储，而不管对齐。

现在，由于上面的列表不包含双重四字（16字节）的相同语言，因此架构不能保证访问16字节的存储器的指令是原子的。

也就是说，最后一段提到了一个出路，即带有LOCK前缀的CMPXCHG16B指令。您可以使用CPUID指令来确定您的处理器是否支持CMPXCHG16B（CX16功能位）。

在相应的AMD文档中， AMD64技术AMD64架构程序员手册第2卷：系统编程，我找不到类似的清晰语言。

EDIT：测试计划结果

（测试程序修改为将#iterations增加10倍） / p>

在Xeon X3450（x86-64）上：

0000 999998139 1572 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 1861 999998428

在Xeon 5150（32位）：

0000 999243100 283087 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 756900 999716913

在Opteron 2435（x86-64） p>

0000 999995893 1901 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 4107 999998099

这是否意味着英特尔和/或AMD保证16个字节的内存访问在这些机器上是原子的？ IMHO，它不。它不在文档中作为保证的体系结构行为，因此不能知道在这些特定处理器上16字节存储器访问是否真正是原子的，或者测试程序是否由于某种原因而不能触发它们。因此依赖它是危险的。

编辑2：如何使测试程序失败 $ b

哈！我设法使测试程序失败。在与上面相同的Opteron 2435上，使用相同的二进制，但现在通过numactl工具运行它指定每个线程在单独的套接字上运行，我得到：

0000 999998634 5990 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 1不是单个存储器访问！ 1101 0 0 1110 0 0 1111 1366 999994009

这是什么意思？嗯，Opteron 2435可以或者可以不保证16字节存储器访问对于套接字访问是原子的，但是至少在两个套接字之间的超传输互连上运行的高速缓存一致性协议不提供这样的保证。 / p>

编辑3：根据GJ的请求，线程函数的ASM。

以下是Opteron 2435系统上使用的GCC 4.4 x86-64版本的线程函数的生成的asm：

.globl thread2 .type thread2，@function thread2： .LFB537： .cfi_startproc movdqa .LC3（％rip），％xmm1 xorl％eax，％eax .p2align 5，，24 .p2align 3 .L11： movaps x（％rip），％xmm0 incl％eax movaps％xmm1，x（％rip） movmskps％xmm0，％edx movslq％edx，％rdx incl n2（，％rdx，4） cmpl $ 1000000000，％eax jne .L11 xorl％eax，％eax ret .cfi_endproc .LFE537： .size thread2 ，。-thread2 .p2align 5，，31 .globl thread1 .type thread1，@function thread1： .LFB536： .cfi_startproc pxor％xmm1，％xmm1 xorl％eax，％eax .p2align 5，and 24 .p2align 3 .L15： movaps x （％rip），％xmm0 incl％eax movaps％xmm1，x（％rip） movmskps％xmm0，％edx movslq％edx，％rdx incl n1（，％rdx，4） cmpl $ 1000000000，％eax jne .L15 xorl％eax，％eax ret .cfi_endproc

并且为了完整性，.LC3是包含（-1，-1，-1， 1）thread2使用的向量：

.LC3： .long -1 .long -1 .long -1 .long -1 .identGCC：（GNU）4.4.4 20100726（Red Hat 4.4.4-13） .section.note.GNU-stack，，@ progbits

另请注意，这是AT& T ASM语法，而不是Intel语法，Windows程序员可能更熟悉。最后，这是与march = native这使得GCC喜欢MOVAPS;但它没有关系，如果我使用march = core2它将使用MOVDQA存储到x，我仍然可以再现失败。

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.

The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.

The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.

This question relates to problems such as research.swtch/2010/02/off-to-races.html

Update:

I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.

// Compile with: // gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2 // // Make sure you have at least two physical CPU cores or hyper-threading. #include <pthread.h> #include <emmintrin.h> #include <stdio.h> #include <stdint.h> #include <string.h> typedef int v4si __attribute__ ((vector_size (16))); volatile v4si x; unsigned n1[16] __attribute__((aligned(64))); unsigned n2[16] __attribute__((aligned(64))); void* thread1(void *arg) { for (int i=0; i<100*1000*1000; i++) { int mask = _mm_movemask_ps((__m128)x); n1[mask]++; x = (v4si){0,0,0,0}; } return NULL; } void* thread2(void *arg) { for (int i=0; i<100*1000*1000; i++) { int mask = _mm_movemask_ps((__m128)x); n2[mask]++; x = (v4si){-1,-1,-1,-1}; } return NULL; } int main() { // Check memory alignment if ( (((uintptr_t)&x) & 0x0f) != 0 ) abort(); memset(n1, 0, sizeof(n1)); memset(n2, 0, sizeof(n2)); pthread_t t1, t2; pthread_create(&t1, NULL, thread1, NULL); pthread_create(&t2, NULL, thread2, NULL); pthread_join(t1, NULL); pthread_join(t2, NULL); for (unsigned i=0; i<16; i++) { for (int j=3; j>=0; j--) printf("%d", (i>>j)&1); printf(" %10u %10u", n1[i], n2[i]); if(i>0 && i<0x0f) { if(n1[i] || n2[i]) printf(" Not a single memory access!"); } printf("\n"); } return 0; }

The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:

0000 96905702 10512 0001 0 0 0010 0 0 0011 22 12924 Not a single memory access! 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 3092557 1175 Not a single memory access! 1101 0 0 1110 0 0 1111 1719 99975389

解决方案

In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.2.3.1, as you note yourself, that

The Intel-64 memory ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access: • Instructions that read or write a single byte. • Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary. • Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary. • Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary. Any locked instruction (either the XCHG instruction or another read-modify-write instruction with a LOCK prefix) appears to execute as an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of alignment.

Now, since the above list does NOT contain the same language for double quadword (16 bytes), it follows that the architecture does NOT guarantee that instructions which access 16 bytes of memory are atomic.

That being said, the last paragraph does hint at a way out, namely the CMPXCHG16B instruction with the LOCK prefix. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).

In the corresponding AMD document, AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 2: System Programming, I can't find similar clear language.

EDIT: Test program results

(Test program modified to increase #iterations by a factor of 10)

On a Xeon X3450 (x86-64):

0000 999998139 1572 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 1861 999998428

On a Xeon 5150 (32-bit):

0000 999243100 283087 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 756900 999716913

On an Opteron 2435 (x86-64):

0000 999995893 1901 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 4107 999998099

Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

EDIT 2: How to make the test program fail

Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:

0000 999998634 5990 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 1 Not a single memory access! 1101 0 0 1110 0 0 1111 1366 999994009

So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

EDIT 3: ASM for the thread functions, on request of "GJ."

Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:

.globl thread2 .type thread2, @function thread2: .LFB537: .cfi_startproc movdqa .LC3(%rip), %xmm1 xorl %eax, %eax .p2align 5,,24 .p2align 3 .L11: movaps x(%rip), %xmm0 incl %eax movaps %xmm1, x(%rip) movmskps %xmm0, %edx movslq %edx, %rdx incl n2(,%rdx,4) cmpl $1000000000, %eax jne .L11 xorl %eax, %eax ret .cfi_endproc .LFE537: .size thread2, .-thread2 .p2align 5,,31 .globl thread1 .type thread1, @function thread1: .LFB536: .cfi_startproc pxor %xmm1, %xmm1 xorl %eax, %eax .p2align 5,,24 .p2align 3 .L15: movaps x(%rip), %xmm0 incl %eax movaps %xmm1, x(%rip) movmskps %xmm0, %edx movslq %edx, %rdx incl n1(,%rdx,4) cmpl $1000000000, %eax jne .L15 xorl %eax, %eax ret .cfi_endproc

and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:

.LC3: .long -1 .long -1 .long -1 .long -1 .ident "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)" .section .note.GNU-stack,"",@progbits

Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

更多推荐

SSE指令：哪些CPU可以执行原子16B内存操作？

SSE指令：哪些CPU可以执行原子16B内存操作？

发布评论取消回复

最近发表

热门文章

标签列表