GPU上的模块化算术

编程入门行业动态更新时间:2024-10-28 02:31:10

本文介绍了GPU上的模块化算术的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在研究应该执行大量模块化计算的GPU算法.特别是，从长远来看，对有限域中的矩阵进行各种运算简化为原始运算，例如:(a * b-c * d)mod m或(a * b + c)mod m，其中a，b，c和d是模m的残基，m是32位素数.

I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run reduce to primitive operations like: (a*b - c*d) mod m or (a*b + c) mod m where a,b,c and d are residues modulo m and m is a 32-bit prime.

通过实验，我了解到该算法的性能主要受慢模运算的限制，因为硬件中的GPU不支持整数模(％)和除法运算.

Through experimentation I learned that the performance of the algorithm is mostly limited by slow modular arithmetic because integer modulo (%) and division operations are not supported on the GPU in hardware.

我很高兴有人能给我一个想法，如何使用CUDA实现高效的模块化计算?

I appreciate if somebody can give me an idea how to realize efficient modular computations with CUDA ?

要查看如何在CUDA上实现，我使用以下代码片段:

To see how this is implemented on CUDA, I use the following code snippet:

__global__ void mod_kernel(unsigned *gout, const unsigned *gin) { unsigned tid = threadIdx.x; unsigned a = gin[tid], b = gin[tid * 2], m = gin[tid * 3]; typedef unsigned long long u64; __syncthreads(); unsigned r = (unsigned)(((u64)a * (u64)b) % m); __syncthreads(); gout[tid] = r; }

该代码不起作用，我只是想看看模块化简化是如何进行的在CUDA上实施.

This code is not supposed to work, I just wanted to see how modular reduction is implemented on CUDA.

当我用cuobjdump --dump-sass进行拆卸时(感谢njuffa的建议！)，我看到以下内容:

When I disassemble this with cuobjdump --dump-sass (thanks njuffa for advice!), I see the following:

/*0098*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ; /*00a0*/ /*0x1c315c4350000000*/ IMUL.U32.U32.HI R5, R3, R7; /*00a8*/ /*0x1c311c0350000000*/ IMUL.U32.U32 R4, R3, R7; /*00b0*/ /*0xfc01dde428000000*/ MOV R7, RZ; /*00b8*/ /*0xe001000750000000*/ CAL 0xf8; /*00c0*/ /*0x00000007d0000000*/ BPT.DRAIN 0x0; /*00c8*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;

请注意，在两次调用bar.red.popc之间，存在对0xf8过程的调用，该过程实现了一些复杂的算法(大约50条指令，甚至更多条指令).不用担心mod(％)操作很慢

Note that between the two calls to bar.red.popc there is a call to 0xf8 procedure which implements some sophisticated algorithm (about 50 instructions or even more). Not surpising that mod (%) operation is slow

推荐答案

前段时间，我在GPU上对模块化算法进行了很多实验.在Fermi GPU上，您可以使用双精度算法来避免昂贵的div和mod操作.例如，模块化乘法可以按如下方式完成:

Some time ago I experimented a lot with modular arithmetic on the GPU. On Fermi GPUs you can use double-precision arithmetic to avoid expensive div and mod operations. For example, modular multiplication can be done as follows:

// fast truncation of double-precision to integers #define CUMP_D2I_TRUNC (double)(3ll << 51) // computes r = a + b subop c unsigned using extended precision #define VADDx(r, a, b, c, subop) \ asm volatile("vadd.u32.u32.u32." subop " %0, %1, %2, %3;" : \ "=r"(r) : "r"(a) , "r"(b), "r"(c)); // computes a * b mod m; invk = (double)(1<<30) / m __device__ __forceinline__ unsigned mul_m(unsigned a, unsigned b, volatile unsigned m, volatile double invk) { unsigned hi = __umulhi(a*2, b*2); // 3 flops // 2 double instructions double rf = __uint2double_rn(hi) * invk + CUMP_D2I_TRUNC; unsigned r = (unsigned)__double2loint(rf); r = a * b - r * m; // 2 flops // can also be replaced by: VADDx(r, r, m, r, "min") // == umin(r, r + m); if((int)r < 0) r += m; return r; }

但是，这仅适用于31位整数模(如果1位对您不重要)并且您还需要预先计算"invk".这给出了我可以实现的绝对最低限度的指令，即:

However this only works for 31-bit integer modulos (if 1 bit is not critical for you) and you also need to precompute 'invk' beforehand. This gives absolute minimum of instructions I can achieve, ie.:

SHL.W R2, R4, 0x1; SHL.W R8, R6, 0x1; IMUL.U32.U32 R4, R4, R6; IMUL.U32.U32.HI R8, R2, R8; I2F.F64.U32 R8, R8; DFMA R2, R2, R8, R10; IMAD.U32.U32 R4, -R12, R2, R4; ISETP.GE.AND P0, pt, R4, RZ, pt; @!P0 IADD R4, R12, R4;

关于算法的描述，您可以看一下我的论文: a> gpu_resultants

For description of the algorithm, you can have a look at my paper: gpu_resultants. Other operations like (xy - zw) mod m are also explained there.

出于好奇，我比较了所得算法的性能使用模块化乘法:

Out of curiosity, I compared the performance of the resultant algorithm using your modular multiplication:

unsigned r = (unsigned)(((u64)a * (u64)b) % m);

使用mul_m反对优化版本.

against the optimized version with mul_m.

具有默认％运算的模块化算术:

Modular arithmetic with default % operation:

low_deg: 11; high_deg: 2481; bits: 10227 nmods: 330; n_real_pts: 2482; npts: 2495 res time: 5755.357910 ms; mod_inv time: 0.907008 ms; interp time: 856.015015 ms; CRA time: 44.065857 ms GPU time elapsed: 6659.405273 ms;

具有mul_m的模块化算术:

Modular arithmetic with mul_m:

low_deg: 11; high_deg: 2481; bits: 10227 nmods: 330; n_real_pts: 2482; npts: 2495 res time: 1100.124756 ms; mod_inv time: 0.192608 ms; interp time: 220.615143 ms; CRA time: 10.376352 ms GPU time elapsed: 1334.742310 ms;

因此，平均而言，速度要快5倍左右.另请注意，如果仅使用带有一堆mul_mod操作的内核(例如 saxpy 示例)来评估 raw 算术性能，则可能看不到加速.但是在具有控制逻辑，同步障碍等的实际应用中，加速非常明显.

So on the average it is about 5x faster. Note also that, you might not see a speed-up if you just evaluate raw arithmetic performance using a kernel with a bunch of mul_mod operations (like saxpy example). But in real applications with control logic, synchronization barriers etc. the speed-up is very noticeable.

更多推荐

GPU上的模块化算术

本文发布于:2023-06-01 10:31:40，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/413030.html