sse C ++内存命令(sse C++ memory commands)

系统教程 行业动态 更新时间:2024-06-14 16:59:47
sse C ++内存命令(sse C++ memory commands)

SSE asm具有SQRTPS命令。

SQRTPS命令有2个版本:

SQRTPS xmm1, xmm2 SQRTPS xmm1, m128

gcc / clang / vs(所有)编译器有辅助函数_mm_sqrt_ps 。

但是_mm_sqrt_ps只能用于预加载的xmm(使用_mm_set_ps / _mm_load_ps)。

从Visual Studio,例如: http : //msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx

我期待的是:

__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; asm{ sqrtps xmm0, data // DIRECTLY FROM MEMORY movaps result, xmm0 }

我有什么(在C中):

__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_load_ps(&data) // or _mm_set_ps xmm = _mm_sqrt_ps(xmm); _mm_store_ps(&result[0], xmm);

(在asm中):

movaps xmm1, data sqrtps xmm0, xmm1 // FROM REGISTER movaps result, xmm0

换句话说,我想看到这样的事情:

__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction) _mm_store_ps(&result[0], xmm);

SSE asm has SQRTPS command.

SQRTPS command have 2 versions:

SQRTPS xmm1, xmm2 SQRTPS xmm1, m128

gcc/clang/vs (all) compilers have helper function _mm_sqrt_ps.

But _mm_sqrt_ps can work only with preloaded xmm (with _mm_set_ps / _mm_load_ps).

From Visual Studio, for example: http://msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx

What I expect:

__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; asm{ sqrtps xmm0, data // DIRECTLY FROM MEMORY movaps result, xmm0 }

What I have (in C):

__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_load_ps(&data) // or _mm_set_ps xmm = _mm_sqrt_ps(xmm); _mm_store_ps(&result[0], xmm);

(in asm):

movaps xmm1, data sqrtps xmm0, xmm1 // FROM REGISTER movaps result, xmm0

In other words, I would like to see something like this:

__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction) _mm_store_ps(&result[0], xmm);

最满意答案

你的问题的答案是你无法通过内在函数来控制这一点,至少对于对齐的负载。 由编译器决定它是否使用SQRTPS xmm1,xmm2或SQRTPS xmm1,m128。 如果你想100%确定,那么你必须在汇编中写它。 在我看来,这是内在函数的缺陷之一(至少在他们目前正在实施)。

一些代码可以帮助解释这一点。

我们可以使用GCC(64位-O3)来生成使用对齐和未对齐加载的两个版本

float x[4], y[4] __m128 x4 = _mm_loadu_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);

这给出了(使用Intel语法)

movups xmm0, XMMWORD PTR [rdx] sqrtps xmm0, xmm0

但是,如果我们进行对齐加载,我们会得到另一种形式

float x[4], y[4] __m128 x4 = _mm_load_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);

这将加载和平方根组合成一条指令

sqrtps xmm0, XMMWORD PTR [rax]

大多数人会说“信任编译器”。 我不同意。 如果您正在使用内在函数,那么应该假设您知道自己在做什么而不是编译器。 下面是一个示例性能差异-msvc-and-gcc-for-high-optimized-matrix-multp ,其中GCC选择一种形式而MSVC选择另一种形式(用于乘法而不是sqrt)并且它创建了一个性能差异。

所以再一次,如果你使用对齐的加载,你只能祈祷编译器做你想要的。 然后可能在下一个版本的编译器上做了不同的事情......

The answer to your question is that you can't control this , at least for aligned loads, with intrinsics. It's up to the compiler to decide if it uses SQRTPS xmm1, xmm2 or SQRTPS xmm1, m128. If you want to be 100% certain then you have to write it in assembly. This is one of the deficiencies of intrinsics (at least as they are currently implemented) in my opinion.

Some code can help explain this.

We can get GCC (64-bit with -O3) to generate both version using aligned and unaligned loads

float x[4], y[4] __m128 x4 = _mm_loadu_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);

This gives (with Intel syntax)

movups xmm0, XMMWORD PTR [rdx] sqrtps xmm0, xmm0

However, if we do an aligned load we get the other form

float x[4], y[4] __m128 x4 = _mm_load_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);

This combines the load and square root into one instruction

sqrtps xmm0, XMMWORD PTR [rax]

Most people would say "trust the compiler." I disagree. If you're using intrinsics then it should be assumed that YOU know what you're doing and NOT the compiler. Here is an example difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp where GCC chose one form and MSVC chose the other form (for multiplication instead of the sqrt) and it made a difference in performance.

So once again, if you're using aligned loads, you can only pray that the compiler does what you want. And then maybe on the next version of the compiler it does something different...

更多推荐

本文发布于:2023-04-17 09:02:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/dzcp/b4637b08c270813bb1c9eb18f518ad25.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:命令   内存   sse   memory   commands

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!