ARM霓虹内在函数的深度变换(depth transformation with ARM neon intrinsics)

编程入门 行业动态 更新时间:2024-10-28 06:31:40
ARM霓虹内在函数的深度变换(depth transformation with ARM neon intrinsics)

我试图绕过NEON内在函数,并认为我可以从一个例子开始并提出一些问题。

在这个实验中,我想将32位RGB转换为16位BGR。 将以下代码转换为使用NEON内在函数会有什么好的开始? 我在这里遇到的问题是16bit与我能看到的任何内在都不匹配。 有16x4 16x8等等。但我只是运气不好,围绕着我需要如何解决这个问题。 有小费吗?

这是我试图转换的代码。

typedef struct { uint16_t b:5, g:6, r:5; } _color16; static int depth_transform_32_to_16_c (VisVideo *dest, VisVideo *src) { int x, y; int w; int h; _color16 *dbuf = visual_video_get_pixels (dest); uint8_t *sbuf = visual_video_get_pixels (src); uint16x8 int ddiff; int sdiff; depth_transform_get_smallest (dest, src, &w, &h); ddiff = (dest->pitch / dest->bpp) - w; sdiff = src->pitch - (w * src->bpp); for (y = 0; y < h; y++) { for (x = 0; x < w; x++) { dbuf->b = *(sbuf++) >> 3; dbuf->g = *(sbuf++) >> 2; dbuf->r = *(sbuf++) >> 3; dbuf++; sbuf++; } dbuf += ddiff; sbuf += sdiff; } return VISUAL_OK; }

编辑:哦,由于某种原因我考虑16x3位,但我们正在考虑5,6,5 = 16位。 我意识到我需要轮班。 嗯。

I'm trying to wrap my head around NEON intrinsics, and figured I could start with an example and ask some questions.

In this experiment I want to convert 32bit RGB to 16bit BGR. What would be a good start in converting the following code to use NEON intrinsics? The problem I'm having here is that 16bit doesn't match any intrinsic that I can see. There's 16x4 16x8, etc.. but I'm just having little luck wrapping my thoughts around how I need to approach this. Any tips?

Here's the code I'm trying to convert.

typedef struct { uint16_t b:5, g:6, r:5; } _color16; static int depth_transform_32_to_16_c (VisVideo *dest, VisVideo *src) { int x, y; int w; int h; _color16 *dbuf = visual_video_get_pixels (dest); uint8_t *sbuf = visual_video_get_pixels (src); uint16x8 int ddiff; int sdiff; depth_transform_get_smallest (dest, src, &w, &h); ddiff = (dest->pitch / dest->bpp) - w; sdiff = src->pitch - (w * src->bpp); for (y = 0; y < h; y++) { for (x = 0; x < w; x++) { dbuf->b = *(sbuf++) >> 3; dbuf->g = *(sbuf++) >> 2; dbuf->r = *(sbuf++) >> 3; dbuf++; sbuf++; } dbuf += ddiff; sbuf += sdiff; } return VISUAL_OK; }

Edit: oh, for some reason I was looking at this considering 16x3 bits, but we're looking at 5,6,5 = 16bits. I realize I need shifts. Hmm.

最满意答案

NEON使用128位宽的寄存器,因此从概念上讲,您想要做的是读取4位32位RGB像素,对它们使用按位运算,最终写出16位像素。 一个观察结果是,为了获得最佳性能,您可能需要组合两个128位输入(8个32位像素)并生成一个128输出。 这将使您的内存访问更有效。

考虑这一点的另一种方法是你正在采用你的内循环内容并且并行执行四个像素。 使用原始代码有点困难的原因是因为你使用了位字段而隐藏了一些“魔法”。 如果您将C代码重写为32位到16位并使用shift /和/或代码将更自然地转换为SIMD,您可以直观地了解如何使用该上下文中的多个数据。

如果你只看每个32位组件 - > 16位转换:

00000000RRRRRRRRGGGGGGGGBBBBBBBB 0000000000000000BBBBBGGGGGGRRRRR

这可以帮助您可视化四个像素并行执行的操作。 转移,提取和组合。 您可以将其视为4个32位通道,但对于某些位操作,寄存器宽度无关紧要(例如,4个32位寄存器或8个16位寄存器是相同的)。

粗伪伪代码:

读(向量加载 )128位寄存器= 4个32位像素。 绿色(所有四个组件)移至右位位置。 将绿色(使用AND掩码)屏蔽到另一个寄存器中。 (概念上仍然是4x32位“模式”) 红色(所有四个组件)移至右位位置。 掩盖红色到另一个寄存器。 蓝色移至右位位置。 掩盖蓝色到另一个寄存器。 红色和蓝色移到右位位置。 使用按位OR组合。 现在,您将拥有4个16位值,32位对齐。 (到目前为止,所有概念仍为4x32位) 重复另一组4像素。 将这两组与NEON解压缩( VUZP )组合以产生一个128位/ 8像素寄存器。 写(向量存储 )那些像素。

NEON uses 128 bit wide registers so conceptually what you want to do is read in four pixels of 32bit RGB, use bitwise operations on those, and eventually write out your 16 bit pixels. One observation is that for best performance you may want to combine two 128 bit inputs (8 32-bit pixels) and produce one 128 output. This will make your memory accesses more efficient.

Another way to think about this is that you are taking your inner loop content and are doing four pixels in parallel. The reason it's a little hard to work with your original code is that some of the "magic" is hidden because you're using bit fields. If you rewrote your C code to work from 32 bit to 16 bit and used shifts/and/or that code would translate to SIMD more naturally and you can visualize how you'd work with multiple data within that context.

If you just look at each 32 bit component -> 16 bit transformations:

00000000RRRRRRRRGGGGGGGGBBBBBBBB 0000000000000000BBBBBGGGGGGRRRRR

This can help you visualize what you need to do in parallel for four pixels. Shift, extract, and combine. You can think about this as 4 32 bit lanes though for some of the bit operations the register width doesn't matter (e.g. or-ing 4 32-bit registers or 8 16-bit registers is the same).

Rough pseudo-code:

Read (vector load) 128 bit register = 4 32 bit pixels. Shift green (all four components) into right bit position. Mask out green (using AND mask) into another register. (conceptually still in 4x32 bit "mode") Shift red (all four components) into right bit position. Mask out red into yet another register. Shift blue into right bit position. Mask out blue into another register. Shift red and blue to right bit positions. Use-bitwise OR to combine. Now you'll have 4 16 bit values with 32 bit alignment. (all so far still conceptually as 4x32 bit) Repeat with another set of 4 pixels. Combine those two sets with a NEON unzip (VUZP) to produce one 128 bit/8 pixel register. Write (vector store) those pixels.

更多推荐

本文发布于:2023-08-06 21:21:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1455641.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:霓虹   函数   深度   ARM   intrinsics

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!