SSE asm具有SQRTPS命令。
SQRTPS命令有2个版本:
SQRTPS xmm1, xmm2 SQRTPS xmm1, m128gcc / clang / vs(所有)编译器有辅助函数_mm_sqrt_ps 。
但是_mm_sqrt_ps只能用于预加载的xmm(使用_mm_set_ps / _mm_load_ps)。
从Visual Studio,例如: http : //msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx
我期待的是:
__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; asm{ sqrtps xmm0, data // DIRECTLY FROM MEMORY movaps result, xmm0 }我有什么(在C中):
__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_load_ps(&data) // or _mm_set_ps xmm = _mm_sqrt_ps(xmm); _mm_store_ps(&result[0], xmm);(在asm中):
movaps xmm1, data sqrtps xmm0, xmm1 // FROM REGISTER movaps result, xmm0换句话说,我想看到这样的事情:
__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction) _mm_store_ps(&result[0], xmm);SSE asm has SQRTPS command.
SQRTPS command have 2 versions:
SQRTPS xmm1, xmm2 SQRTPS xmm1, m128gcc/clang/vs (all) compilers have helper function _mm_sqrt_ps.
But _mm_sqrt_ps can work only with preloaded xmm (with _mm_set_ps / _mm_load_ps).
From Visual Studio, for example: http://msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx
What I expect:
__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; asm{ sqrtps xmm0, data // DIRECTLY FROM MEMORY movaps result, xmm0 }What I have (in C):
__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_load_ps(&data) // or _mm_set_ps xmm = _mm_sqrt_ps(xmm); _mm_store_ps(&result[0], xmm);(in asm):
movaps xmm1, data sqrtps xmm0, xmm1 // FROM REGISTER movaps result, xmm0In other words, I would like to see something like this:
__attribute__((aligned(16))) float data[4]; __attribute__((aligned(16))) float result[4]; auto xmm = _mm_sqrt_ps(data); // DIRECTLY FROM MEMORY, no need to load (because there is such instruction) _mm_store_ps(&result[0], xmm);最满意答案
你的问题的答案是你无法通过内在函数来控制这一点,至少对于对齐的负载。 由编译器决定它是否使用SQRTPS xmm1,xmm2或SQRTPS xmm1,m128。 如果你想100%确定,那么你必须在汇编中写它。 在我看来,这是内在函数的缺陷之一(至少在他们目前正在实施)。
一些代码可以帮助解释这一点。
我们可以使用GCC(64位-O3)来生成使用对齐和未对齐加载的两个版本
float x[4], y[4] __m128 x4 = _mm_loadu_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);这给出了(使用Intel语法)
movups xmm0, XMMWORD PTR [rdx] sqrtps xmm0, xmm0但是,如果我们进行对齐加载,我们会得到另一种形式
float x[4], y[4] __m128 x4 = _mm_load_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);这将加载和平方根组合成一条指令
sqrtps xmm0, XMMWORD PTR [rax]大多数人会说“信任编译器”。 我不同意。 如果您正在使用内在函数,那么应该假设您知道自己在做什么而不是编译器。 下面是一个示例性能差异-msvc-and-gcc-for-high-optimized-matrix-multp ,其中GCC选择一种形式而MSVC选择另一种形式(用于乘法而不是sqrt)并且它创建了一个性能差异。
所以再一次,如果你使用对齐的加载,你只能祈祷编译器做你想要的。 然后可能在下一个版本的编译器上做了不同的事情......
The answer to your question is that you can't control this , at least for aligned loads, with intrinsics. It's up to the compiler to decide if it uses SQRTPS xmm1, xmm2 or SQRTPS xmm1, m128. If you want to be 100% certain then you have to write it in assembly. This is one of the deficiencies of intrinsics (at least as they are currently implemented) in my opinion.
Some code can help explain this.
We can get GCC (64-bit with -O3) to generate both version using aligned and unaligned loads
float x[4], y[4] __m128 x4 = _mm_loadu_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);This gives (with Intel syntax)
movups xmm0, XMMWORD PTR [rdx] sqrtps xmm0, xmm0However, if we do an aligned load we get the other form
float x[4], y[4] __m128 x4 = _mm_load_ps(x); __m128 y4 = _mm_sqrt_ps(x4); _mm_storeu_ps(y,y4);This combines the load and square root into one instruction
sqrtps xmm0, XMMWORD PTR [rax]Most people would say "trust the compiler." I disagree. If you're using intrinsics then it should be assumed that YOU know what you're doing and NOT the compiler. Here is an example difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp where GCC chose one form and MSVC chose the other form (for multiplication instead of the sqrt) and it made a difference in performance.
So once again, if you're using aligned loads, you can only pray that the compiler does what you want. And then maybe on the next version of the compiler it does something different...
更多推荐
发布评论