NEON 怎么可能像 C 一样慢?

How could NEON be as slow as C?

我一直在尝试构建一个快速直方图函数,该函数通过为传入值分配一个值(即它们最接近的范围阈值)将其分入范围.这是将应用于图像的东西,因此它必须很快(假设图像数组为 640x480,因此有 300,000 个元素).直方图范围数是 (0,25,50,75,100) 的倍数.输入将是浮点数,最终输出显然是整数

I have been trying to build a fast Histogram function that would bucket incoming values into ranges by assigning them a value - which is the range threshold they are closest to. This is something that would be applied to images so it would have to be fast (assume an image array of 640x480 so 300,000 elements) . The histogram range numbers are multiples (0,25,50,75,100) . Inputs would be float and final outputs would obviously be integers

我通过打开一个新的空项目(无应用程序委托)并仅使用 main.m 文件在 xCode 上测试了以下版本.我删除了除 Accelerate 之外的所有链接库.

I tested the following versions on xCode by opening a new empty project (no app delegate) and just using the main.m file. I removed all linked libraries with the exception of Accelerate.

这是 C 实现:旧版本有很多 if then 但这是最​​终优化的逻辑.花了 11 秒和 300 毫秒.

Here is the C implementation: the older version was plenty of if then but here is the final optimized logic. it took 11s and 300ms.

int main(int argc, char *argv[])

  int sizeOfArray=300000;

  float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
  int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray);

  for (int i=0; i<sizeOfArray; ++i)

  //Assume range is [0,25,50,75,100]
  int lcd=25;

  for (int j=0; j<1000; ++j)// just to get some good time interval
    for (int i=0; i<sizeOfArray; ++i)
        //a 60.5 would give a 50. An 88.5 would give 100

这是 vDSP 实现.即使有一些繁琐的来回浮动到整数,也只用了 6s!提高了近 50%!

Here is the vDSP implementation. Even with some of the tedious floating to integer back and forth, it took only 6s! almost 50% improvement!

//vDSP implementation
 int main(int argc, char *argv[])

   int sizeOfArray=300000;

   float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
   float* outputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);//vDSP requires matching of input output
   int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray); //rounded value to the nearest integere
   float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
   int* finalOutputArray=(int*) malloc(sizeof(int)*sizeOfArray); //to compare apples to apples scenarios output

   for (int i=0; i<sizeOfArray; ++i)
     inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.

   for (int j=0; j<1000; ++j)// just to get some good time interval
     //Assume range is [0,25,50,75,100]
     float lcd=25.0f;

     //divide by lcd
     vDSP_vsdiv(inputArray, 1, &lcd, outputArrayF, 1,sizeOfArray);

     //Round to nearest integer
     vDSP_vfixr32(outputArrayF, 1,outputArray, 1, sizeOfArray);

     // MUST convert int to float (cannot just cast) then multiply by scalar - This step has the effect of rounding the number to the nearest lcd.
    vDSP_vflt32(outputArray, 1, outputArrayF, 1, sizeOfArray);
    vDSP_vsmul(outputArrayF, 1, &lcd, finalOutputArrayF, 1, sizeOfArray);
    vDSP_vfix32(finalOutputArrayF, 1, finalOutputArray, 1, sizeOfArray);

这是 Neon 的实现.这是我的第一次所以玩得很好!它比 vDSP 慢,需要 9 秒和 300 毫秒,这对我来说没有意义.要么 vDSP 比 NEON 优化得更好,要么我做错了什么.

Here is the Neon implementation. This is my first so play nice! it was slower than vDSP and took 9 sec and 300ms which did not make sense to me. Either vDSP is better optimized than NEON or I am doing something wrong.

//NEON implementation
int main(int argc, char *argv[])

int sizeOfArray=300000;

float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);

for (int i=0; i<sizeOfArray; ++i)
    inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.

for (int j=0; j<1000; ++j)// just to get some good time interval
    float32x4_t c0,c1,c2,c3;
    float32x4_t e0,e1,e2,e3;
    float32x4_t f0,f1,f2,f3;

    //ranges of histogram buckets
    float32x4_t buckets0=vdupq_n_f32(0);
    float32x4_t buckets1=vdupq_n_f32(25);
    float32x4_t buckets2=vdupq_n_f32(50);
    float32x4_t buckets3=vdupq_n_f32(75);
    float32x4_t buckets4=vdupq_n_f32(100);

    //midpoints of ranges
    float32x4_t thresholds1=vdupq_n_f32(12.5);
    float32x4_t thresholds2=vdupq_n_f32(37.5);
    float32x4_t thresholds3=vdupq_n_f32(62.5);
    float32x4_t thresholds4=vdupq_n_f32(87.5);

    for (int i=0; i<sizeOfArray;i+=16)
        c0= vld1q_f32(&inputArray[i]);//load
        c1= vld1q_f32(&inputArray[i+4]);//load
        c2= vld1q_f32(&inputArray[i+8]);//load
        c3= vld1q_f32(&inputArray[i+12]);//load


        f0=vbslq_f32(e0, buckets1, f0);

        f0=vbslq_f32(e0, buckets2, f0);

        f0=vbslq_f32(e0, buckets3, f0);

        f0=vbslq_f32(e0, buckets4, f0);

        f1=vbslq_f32(e1, buckets1, f1);

        f1=vbslq_f32(e1, buckets2, f1);

        f1=vbslq_f32(e1, buckets3, f1);

        f1=vbslq_f32(e1, buckets4, f1);

        f2=vbslq_f32(e2, buckets1, f2);

        f2=vbslq_f32(e2, buckets2, f2);

        f2=vbslq_f32(e2, buckets3, f2);

        f2=vbslq_f32(e2, buckets4, f2);

        f3=vbslq_f32(e3, buckets1, f3);

        f3=vbslq_f32(e3, buckets2, f3);

        f3=vbslq_f32(e3, buckets3, f3);

        f3=vbslq_f32(e3, buckets4, f3);

        vst1q_f32(&finalOutputArrayF[i], f0);
        vst1q_f32(&finalOutputArrayF[i+4], f1);
        vst1q_f32(&finalOutputArrayF[i+8], f2);
        vst1q_f32(&finalOutputArrayF[i+12], f3);

PS:这是我第一次在这个规模上进行基准测试,所以我尽量保持简单(大循环,设置代码不变,使用 NSlog 打印开始/结束时间,只加速链接框架).如果这些假设中的任何一个对结果产生重大影响,请批评.

PS: this is my first benchmarking on this scale so I tried to keep it simple (large loops, setup code constant, using NSlog to print start/end time, only accelerate framework linked). If any of these assumptions are significantly impacting the outcome, please critique.



首先,这不是NEON"本身.这是内在的.在 clang 或 gcc 下使用内在函数几乎不可能获得良好的 NEON 性能.如果您认为需要内在函数,则应该手写汇编程序.

First, this is not "NEON" per-se. This is intrinsics. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need intrinsics, you should hand-write the assembler.

vDSP 并不比 NEON优化得更好".iOS 上的 vDSP 使用 NEON 处理器.vDSP 对 NEON 的使用比您对 NEON 的使用得到了更好的优化.

vDSP is not "better optimized" than NEON. vDSP on iOS uses the NEON processor. vDSP's use of the NEON is much better optimized than your use of the NEON.

我还没有仔细研究过你的内在代码,但最有可能(实际上几乎可以肯定)的问题是你正在创建等待状态.用汇编程序编写(而内在函数只是戴上焊接手套编写的汇编程序),与用 C 编写完全不同.您不会循环相同的内容.你比较不一样.你需要一种新的思维方式.在汇编中,您一次可以做不止一件事情(因为您有不同的逻辑单元),但是您绝对必须以所有这些事情可以并行运行的方式来安排事情.良好的组装可以使所有这些管道保持完整.如果您可以阅读您的代码并且它非常有意义,那么它可能是垃圾汇编代码.如果你从不重复自己,那可能是垃圾汇编代码.您需要仔细考虑将进入哪个寄存器的内容以及在您被允许读取之前有多少个周期.

I haven't dug through your intrinsics code yet, but the most likely (in fact almost certain) cause of trouble is that you're creating wait states. Writing in assembler (and intrinsics are just assembler written with welding gloves on), is nothing like writing in C. You don't loop the same. You don't compare the same. You need a new way of thinking. In assembly you can do more than one thing at a time (because you have different logic units), but you absolutely have to schedule things in such a way that all those things can run in parallel. Good assembly keeps all those pipelines full. If you can read your code and it makes perfect sense, it's probably crap assembly code. If you never repeat yourself, it's probably crap assembly code. You need to carefully consider what is going into what register and at how many cycles there are until you're allowed to read it.

如果它像音译 C 一样简单,那么编译器会为您做到这一点.当你说我打算用 NEON 写这个"时,你是在说我认为我可以写出比编译器更好的 NEON",因为编译器也使用它.也就是说,通常可以编写比编译器更好的 NEON(尤其是 gcc 和 clang).

If it were as easy as transliterating C, then the compiler would do that for you. The moment you say "I'm going to write this in NEON" you're saying "I think I can write better NEON than the compiler," because the compiler uses it too. That said, it often is possible to write better NEON than the compiler (particularly gcc and clang).


If you're ready to go diving into that world (and it's a pretty cool world), you have some reading ahead of you. Here's some places I recommend:

http://www.coranac/tonc/text/asm.htm(你真的很想花一些时间在这个上面)http://hilbert-space.de/(整个网站.他也停止了写作很快.)对于您的具体问题,他在此处进行了解释:http://hilbert-space.de/?p=22 http://www.coranac/tonc/text/asm.htm (You really want to spend some time with this one) http://hilbert-space.de/ (The whole site. He stopped writting way too soon.) To your specific question, he explains it here: http://hilbert-space.de/?p=22 http://robnapier/blog/fast-bezier-intro-701http://robnapier/blog/faster-bezier-722

综上所述... 始终始终从重新考虑您的算法开始.答案通常不是如何让循环快速计算,而是如何不那么频繁地调用循环.

ALL THAT SAID... Always always always start by reconsidering your algorithm. Often the answer is not how to make your loop calculate quickly, it's how to not call the loop so often.

