CUDA和并行寻址位(CUDA and addressing bits in parallel)

编程入门 行业动态 更新时间:2024-10-25 11:30:25
CUDA和并行寻址位(CUDA and addressing bits in parallel)

我想编写一个CUDA程序,它返回一个包含特定条件的更大数组的位置。

执行此操作的简单方法是编写一个内核,如果保留条件,则返回一个整数数组,如果不符合则返回0。

另一种方法可能是仅返回找到的索引 - 但根据我对GPU同步的了解(这相当于在GPU上实现队列/链接列表),这将是有问题的。

提出的第一个想法的问题是数组将处于输入大小。

我想到的另一种方法是创建一个log(n)/ 8 + 1(n =我检查的项目数)的数组,并为每个数组位置使用1位(保持输出的一种压缩表示) 。

我唯一找不到的是CUDA是否支持并行位寻址。

我现在如何做的一个例子:

__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU) { int start_idx = threadIdx.x + (blockIdx.x * blockDim.x); if (start_idx > *gputTextSize - *gputSearchSize){return;} unsigned int wrong=0; for(int i=0; i<*gputSearchSize;i++){ wrong = calculationOnGpu(gpuText, gpuFind, start_idx,i, gputSearchSize); } resultsGPU[start_idx] = !wrong; }

我想要做的是使用int或char作为“resultsGpu”变量,而不是使用其他东西。

谢谢

I want to write a CUDA program that returns locations of a bigger array that hold a specific criteria.

The trivial way to do it is to write a kernel that returns an array of integers with 1 if the criteria was held, or 0 if it was not.

Another way might be to return only indexes that were found - but that would be problematic based on my knowledge of GPU synchronization (it's equivalent to implement a queue/linked list on GPU).

The problem with the first idea presented is that the array would be in the input size.

Another way I thought about is to create an array the size of log(n)/8+1 (n=number of items I check), and use 1 bit for each array location (holding a sort of compressed representation of the output).

The only thing I could not find is if CUDA supports bit addressing in parallel..

An example of how I am doing it now:

__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU) { int start_idx = threadIdx.x + (blockIdx.x * blockDim.x); if (start_idx > *gputTextSize - *gputSearchSize){return;} unsigned int wrong=0; for(int i=0; i<*gputSearchSize;i++){ wrong = calculationOnGpu(gpuText, gpuFind, start_idx,i, gputSearchSize); } resultsGPU[start_idx] = !wrong; }

What I want to do is instead of using int or char for the "resultsGpu" variable , to use something else.

Thanks

最满意答案

CUDA GPU 可以访问 1,2,4,8或16 字节边界上的项目 。 它无法独立访问字节中的位。

通过读取更大的项(例如char或int ,修改寄存器中的位,然后将该项写回存储器,可以修改字节中的位。 因此,它将是一个读 - 修改 - 写操作。

为了在具有多个线程的情况下保留相邻位,有必要以原子方式更新项( char , int等)。没有对char数量进行操作的原子,因此需要将这些位分组为数量为32,并写为例如int 。 遵循这个习惯用法,每个线程都会进行原子操作。

32当前也恰好是warp大小,因此基于warp的内在可能是更有效的方式,特别是warp vote __ballot()函数。 像这样的东西:

__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU) { int start_idx = threadIdx.x + (blockIdx.x * blockDim.x); if (start_idx > *gputTextSize - *gputSearchSize){return;} unsigned int wrong=0; wrong = calculationOnGpu(gpuText, gpuFind, start_idx,0, gputSearchSize); wrong = __ballot(wrong); if ((threadIdx.x & 31) == 0) resultsGPU[start_idx/32] = wrong; }

您还没有提供完整的代码,因此上面只是描述了如何完成代码。 我不确定原始内核中的循环无论如何都是一种有效的方法,并且上面假设每个数据项要搜索1个线程。 即使在搜索到的数组的一端或另一端存在非活动线程时, __ballot()应该是安全的。

A CUDA GPU can access items on boundaries of 1,2,4,8, or 16 bytes. It does not have the ability to independently access bits in a byte.

Bits in a byte would be modified by reading a larger size item, such as char or int, modifying the bit(s) in a register, then writing that item back to memory. Thus it would be a read-modify-write operation.

In order to preserve adjacent bits in such a scenario with multiple threads, it would be necessary to atomically update the item (char, int, etc.) There are no atomics that operate on char quantities, so the bits would need to be grouped into quantities of 32, and written e.g. as int. Following that idiom, every thread would be doing an atomic operation.

32 also happens to be the warp size currently, so a warp-based intrinsic might be a more efficient way to go here, in particular the warp vote __ballot() function. Something like this:

__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU) { int start_idx = threadIdx.x + (blockIdx.x * blockDim.x); if (start_idx > *gputTextSize - *gputSearchSize){return;} unsigned int wrong=0; wrong = calculationOnGpu(gpuText, gpuFind, start_idx,0, gputSearchSize); wrong = __ballot(wrong); if ((threadIdx.x & 31) == 0) resultsGPU[start_idx/32] = wrong; }

You haven't provided a complete code, so the above is just a sketch of how it might be done. I'm not sure the loop in your original kernel was an efficient approach anyway, and the above assumes 1 thread per data item to be searched. __ballot() should be safe even in the presence of inactive threads at one end or the other of the array being searched.

更多推荐

本文发布于:2023-07-29 14:11:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1316776.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:addressing   CUDA   parallel   bits

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!