是CUDA固定存储器零拷贝吗？

编程入门行业动态更新时间:2024-10-28 08:24:35

本文介绍了是CUDA固定存储器零拷贝吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

固定内存应该会提高主机到设备的传输速率（ api reference）。然而，我发现我不需要调用cuMemcpyHtoD为内核访问的值，或cuMemcpyDtoA的主机读回的值。我不认为这将工作，但它是：

Pinned memory is supposed to increase transfer rates from host to device (api reference). However I found that I do not need to call cuMemcpyHtoD for the kernel to access the values, or cuMemcpyDtoA for the host to read the values back. I didn't think this would work, but it does:

__global__ void testPinnedMemory(double * mem) { double currentValue = mem[threadIdx.x]; printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue); mem[threadIdx.x] = currentValue+10; } void test() { const size_t THREADS = 8; double * pinnedHostPtr; cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault); //set memory values for (size_t i = 0; i < THREADS; ++i) pinnedHostPtr[i] = i; //call kernel dim3 threadsPerBlock(THREADS); dim3 numBlocks(1); testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr); //read output printf("Data after kernel execution: "); for (int i = 0; i < THREADS; ++i) printf("%f ", pinnedHostPtr[i]); printf("\n"); }

输出：

Data after kernel execution: 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000 Thread id: 0, memory content: 0.000000 Thread id: 1, memory content: 1.000000 Thread id: 2, memory content: 2.000000 Thread id: 3, memory content: 3.000000 Thread id: 4, memory content: 4.000000 Thread id: 5, memory content: 5.000000 Thread id: 6, memory content: 6.000000 Thread id: 7, memory content: 7.000000

b $ b

我的问题是：

My questions are:

固定内存零拷贝？我认为只有映射的固定内存是零拷贝。
如果它是零拷贝，为什么有一个明确的方式映射到设备（ cudaHostAlloc with cudaHostAllocMapped option）

Is pinned memory zero-copy? I thought only mapped pinned memory was zero-copy.
If it is zero-copy why have an explicit way to map it to device (cudaHostAlloc with cudaHostAllocMapped option)

我使用CUDA Toolkit 5.5，驱动程序设置为TCC模式的Quadro 4000以及编译选项sm_20，compute_20

I'm using CUDA Toolkit 5.5, Quadro 4000 with driver set to TCC mode, and compilation options sm_20,compute_20

推荐答案

恭喜您！您遇到的是2.x计算能力+ TCC + 64位操作系统功能，并且使用较新的CUDA版本：）

Congratulations! You're encountering a 2.x compute capability + TCC + 64-bit OS feature with newer CUDA versions :)

以了解更多信息。

首先，一个小的CUDA教给我们的理论总结：

First a small theory summary as CUDA taught us:

固定内存不是零复制，因为GPU无法访问它（它没有映射到其地址空间），它用于从主机有效地传输到GPU。

Pinned memory is not zero-copy since the GPU cannot access it (it's not mapped in its address space) and it's used to efficiently transfer from the host to the GPU. It's page-locked (valuable kernel resource) memory and has some performance advantages over pageable normal memory.

固定的零拷贝内存是页面锁定内存（通常分配有 cudaHostAllocMapped 标志），由于映射到它的地址空间，GPU也使用它。

Pinned zero-copy memory is page-locked memory (usually allocated with the cudaHostAllocMapped flag) which is also used by the GPU since mapped to its address space.

为什么你在没有明确指定的情况下访问设备从主机分配的内存？

Why you're accessing memory allocated from the host from the device without explicitly specifying it?

请查看CUDA 4.0（及更高版本）的发行说明：

Take a look at the release notes for CUDA 4.0 (and higher):

支持统一的虚拟地址空间。

现在支持64位和计算2.0及更高性能的设备共享单个统一主机和所有设备之间的地址空间。这意味着用于访问主机上的内存的指针是与用于访问设备上的内存的指针相同的。因此，可以直接从其指针值查询存储器的位置;

Devices supporting 64-bit and compute 2.0 and higher capability now share a single unified address space between the host and all devices. This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified.

总结：如果您的卡是2.0+ （并且它是： developer.nvidia/cuda-gpus ），您正在运行64-并且在Windows上您已启用TCC模式，则会自动使用UVA （统一虚拟寻址）主机和设备之间。这意味着：使用零拷贝类访问自动增强您的代码。

To summarize: if your card is 2.0+ (and it is: developer.nvidia/cuda-gpus), you are running a 64-bit OS and on Windows you have a TCC mode on, you're automatically using UVA (Unified Virtual Addressing) between host and device. That means: automatically enhancing your code with zero-copy-like accesses.

这也在

This is also in the CUDA documentation for the current version in the paragraph "Automatic Mapping of Host Allocated Host Memory"

这应该被广告吗？可能是的，不是每个人都读的发行说明（甚至我）：）

Should this be advertised more? Probably yes, not everyone reads the release notes (not even me) :)