从全局函数调用设备函数(Calling a device function from global function)

我应该如何在'print'函数中使用'do_sth'函数（查看代码）？为什么在没有使用cudaMemcpy的情况下，GPU可以看到'N'（查看代码）variable / constant？

__device__ void do_sth(char *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < N) { a[idx] = a[idx]; } } __global__ void print(char *a, int N) { //question_1: why there is an access to N, it is now in GPU memory, how? int idx = blockIdx.x * blockDim.x + threadIdx.x; //do_sth<<<nblock2,blocksize2>>>(a,N); //error_1: a host function call can not be configured //do_sth(&&a,N); //error_2: expected an expression if(idx<N) { a[idx]=a[idx]; } }

How should I acces 'do_sth' function in 'print' function (look at the code)? Why there is 'N' (look at the code) variable/constant visible to GPU without using cudaMemcpy?

__device__ void do_sth(char *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < N) { a[idx] = a[idx]; } } __global__ void print(char *a, int N) { //question_1: why there is an access to N, it is now in GPU memory, how? int idx = blockIdx.x * blockDim.x + threadIdx.x; //do_sth<<<nblock2,blocksize2>>>(a,N); //error_1: a host function call can not be configured //do_sth(&&a,N); //error_2: expected an expression if(idx<N) { a[idx]=a[idx]; } }

最满意答案

__global__函数（又名“内核”）已经驻留在GPU上。所有参数（变量a和N ）都会在调用时通过共享或常量内存（取决于您的设备类型），因此您可以直接访问这些变量。参数大小的限制 - 费米前期卡上的256B和费米卡上的 ~~16KB（？）~~ 4KB，所以如果要传输大量数据，则无法避免使用cudaMemcpy函数。

__global__函数参数不应该被修改。

当从__global__调用__device__ ，您不会在三个括号中指定配置参数。 __device__函数将由内核调用的所有线程调用。请注意，您可以从if语句中调用函数，以防止某些线程执行它。

~~在当前版本的CUDA中，内核执行期间不可能产生更多的线程。~~

在CUDA C ++中没有一个&&操作符（在正常的C ++中没有这样的操作符，当新标准出现时不能确定它）

__global__ function (aka "kernel") resides on the GPU already. All its parameters (variables a and N) are passed through shared or constant memory (depending on your device type) upon the call, so you can directly access those variables. There is a limit of parameters size - 256B on pre-Fermi cards and ~~16KB(?)~~ 4KB on Fermi, so if you have big chunks of data to transfer, you cannot avoid cudaMemcpy functions.

__global__ function parameters should not be modified.

When calling __device__ from __global__ you do not specify the configuration parameters in the triple brackets. The __device__ function will be called by all threads that reach the call from the kernel. Note that you can call functions from within if statements, to prevent some threads from executing it.

~~In current version of CUDA it is impossible to spawn more threads during kernel execution.~~

There is no unary && operator in CUDA C++ (there was no such operator in normal C++, not sure about it now when the new standard emerges)

更多推荐