使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和

编程入门行业动态更新时间:2024-10-23 17:28:42

本文介绍了使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

我刚开始组装，虽然手臂信息中心通常很有帮助，但有时说明对新手来说可能有点混乱.基本上我需要做的是在一个四字寄存器中求和 4 个浮点值，并将结果存储在一个单精度寄存器中.我认为指令 VPADD 可以做我需要的，但我不太确定.

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.

推荐答案

看来你想得到一定长度数组的总和，而不是只有四个浮点值.

It seems that you want to get the sum of a certain length of array, and not only four float values.

在这种情况下，您的代码可以工作，但远未优化:

In that case, your code will work, but is far from optimized :

许多管道互锁

many many pipeline interlocks

每次迭代不必要的 32 位加法

unnecessary 32bit addition per iteration

假设数组的长度是 8 的倍数且至少是 16 :

Assuming the length of the array is a multiple of 8 and at least 16 :

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1

pld - 虽然是 ARM 指令而不是 NEON - 对性能至关重要.它大大提高了缓存命中率.

我希望上面的其余代码是不言自明的.

I hope the rest of the code above is self explanatory.

您会注意到这个版本比最初的版本快很多倍.

You will notice that this version is many times faster than your initial one.

这篇关于使用 NEON 对 ARM 汇编中的四字向量中的所有元素求和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-16 14:37:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/889099.html