问题描述
限时送ChatGPT账号..我刚开始组装,虽然手臂信息中心通常很有帮助,但有时说明对新手来说可能有点混乱.基本上我需要做的是在一个四字寄存器中求和 4 个浮点值,并将结果存储在一个单精度寄存器中.我认为指令 VPADD 可以做我需要的,但我不太确定.
Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.
推荐答案
看来你想得到一定长度数组的总和,而不是只有四个浮点值.
It seems that you want to get the sum of a certain length of array, and not only four float values.
在这种情况下,您的代码可以工作,但远未优化:
In that case, your code will work, but is far from optimized :
许多管道互锁
many many pipeline interlocks
每次迭代不必要的 32 位加法
unnecessary 32bit addition per iteration
假设数组的长度是 8 的倍数且至少是 16 :
Assuming the length of the array is a multiple of 8 and at least 16 :
vldmia {q0-q1}, [pSrc]!
sub count, count, #8
loop:
pld [pSrc, #32]
vldmia {q3-q4}, [pSrc]!
subs count, count, #8
vadd.f32 q0, q0, q3
vadd.f32 q1, q1, q4
bgt loop
vadd.f32 q0, q0, q1
vpadd.f32 d0, d0, d1
vadd.f32 s0, s0, s1
pld - 虽然是 ARM 指令而不是 NEON - 对性能至关重要.它大大提高了缓存命中率.
我希望上面的其余代码是不言自明的.
I hope the rest of the code above is self explanatory.
您会注意到这个版本比最初的版本快很多倍.
You will notice that this version is many times faster than your initial one.
这篇关于使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论