着色器分支的效率(Efficiency of branching in shaders)

编程入门行业动态更新时间:2024-10-26 00:23:52

我明白，这个问题可能看起来有点不扎实，但如果有人知道任何理论上的/有关于这个话题的实践经验 ，那么分享它就会很棒。

我试图优化我的一个旧着色器，它使用了很多纹理查找。

对于三种可能的映射平面中的每一种，我都有漫反射，正常的高光贴图，对于靠近用户的一些面，我也必须应用贴图技术，这也会带来很多纹理查找（如parallax occlusion mapping ）。

分析表明纹理查找是着色器的瓶颈，我愿意将其中的一部分移除。对于输入参数的某些情况，我已经知道纹理查找的一部分是不必要的， 显而易见的解决方案是做类似（伪代码）的操作 ：

if (part_actually_needed) { perform lookups; perform other steps specific for THIS PART; } // All other parts.

现在 - 这里出现了这个问题。

我不记得确切（这就是为什么我说这个问题可能没有根据），但在最近我读过的一些论文中（不幸的是，不记得名字） ，有人指出类似于以下内容：

所呈现的技术的性能取决于基于硬件的条件分支的实施效率。

在我即将开始重构大量着色器并实现基于优化的基础之前，我还记得这种说法。

所以 - 在我开始这么做之前 - 是否有人知道关于着色器分支效率的一些信息？为什么分支会给着色器带来严重的性能损失？

甚至有可能我只能通过基于if的分支来恶化实际的性能？

你可能会说 - 试试看。 是的，这就是我要做的事情，如果没有人在这里帮助我:)

但是， if这种情况对新GPU来说可能有效，对于一些老版本来说可能是一场噩梦。 这种问题很难预测，除非你有很多不同的GPU（这不是我的情况）

所以，如果有人知道这些或有基准经验的这种着色器，我会很感激你的帮助。

实际工作的剩余脑细胞不断告诉我GPU上的分支可能远不如CPU的分支（它通常具有非常有效的分支预测方式并消除高速缓存未命中），因为它是GPU（或可能难以/不可能在GPU上实现）。

不幸的是，我不确定这个陈述与真实情况有什么共同之处......

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.

I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.

I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).

Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):

if (part_actually_needed) { perform lookups; perform other steps specific for THIS PART; } // All other parts.

Now - here comes the question.

I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:

The performance of the presented technique depends on how efficient the HARDWARE-BASED CONDITIONAL BRANCHING is implemented.

I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.

So - right before I start doing that - does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?

And is it even possible that I could only worsen the actual performance with the if-based branching?

You might say - try and see. Yes, that's what I'm going to do if nobody here is helps me :)

But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)

So, if anyone knows something about that or has benchmarking experience for these kinds of shaders, I would really appreciate your help.

Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).

Unfortunately I am not sure if this statement has anything in common with the real situation...

最满意答案

不幸的是，我认为这里的真正答案是在您的目标硬件上使用性能分析器对您的特定情况进行实际测试。特别是鉴于它听起来像你在项目优化阶段; 这是考虑到硬件经常变化以及特定着色器的性质的唯一方法。

在一个CPU上，如果你得到一个错误预测的分支，你会导致一个管道刷新，并且由于CPU管线非常深，你会失去20个或更多周期的东西。在GPU上有点不同; 流水线可能更浅，但没有分支预测，并且所有着色器代码都将在快速内存中 - 但这并不是真正的差别。

很难知道所发生的一切的确切细节，因为nVidia和ATI是相对封闭的，但关键是GPU是为大规模并行执行而制作的。有许多异步着色器核心，但每个核心都是为了运行多个线程而设计的。我的理解是，每个核心都期望在任何给定周期的所有线程上运行相同的指令（nVidia将这个线程集合称为“warp”）。

在这种情况下，一个线程可能代表一个顶点，一个几何元素或一个像素/片段，而一个线程是其中大约32个的集合。对于像素，它们可能是屏幕上彼此接近的像素。问题是，如果在一个warp内，不同的线程在条件跳转时做出不同的决定，warp已经发散，并且不再为每个线程运行相同的指令。硬件可以处理这个问题，但至少它不完全清楚它是如何做到的。对于每一代的卡片，它的处理方式也可能略有不同。最新的，最通用的CUDA /计算着色器友好的nVidias可能具有最好的实现; 较旧的卡可能具有较差的实施。最糟糕的情况是你可能会发现执行if / else语句两边的许多线程。

着色器的一个重要技巧就是学习如何利用这种大规模并行模式。有时候，这意味着使用额外的通道，临时的屏外缓冲区和模板缓冲区来将逻辑从着色器推出到CPU上。有时候优化可能会消耗更多的周期，但实际上可能会减少一些隐藏的开销。

另请注意，您可以明确地将DirectX着色器中的if语句标记为[branch]或[flatten]。扁平化风格为您提供了正确的结果，但始终按照说明执行。如果你没有明确选择一个，编译器可以为你选择一个 - 并且可能选择[flatten]，这对你的例子不利。

有一点要记住的是，如果你跳过第一次纹理查找，这会混淆硬件的纹理坐标导数。你会得到编译器错误，最好不要这样做，否则你可能会错过一些更好的纹理支持。

Unfortunately, I think the real answer here is to do practical testing with a performance analyser of your specific case, on your target hardware. Particularly given that it sounds like you're at project optimisation stage; this is the only way to take into account the fact that hardware changes frequently and the nature of the specific shader.

On a CPU, if you get a mispredicted branch, you'll cause a pipeline flush and since CPU pipelines are so deep, you'll effectively lose something in the order of 20 or more cycles. On the GPU things a little different; the pipeline are likely to be far shallower, but there's no branch prediction and all of the shader code will be in fast memory -- but that's not the real difference.

It's difficult to know the exact details of everything that's going on, because nVidia and ATI are relatively tight-lipped, but the key thing is that GPUs are made for massively parallel execution. There are many asynchronous shader cores, but each core is again designed to run multiple threads. My understanding is that each core expects to run the same instruction on all it's threads on any given cycle (nVidia calls this collection of threads a "warp").

In this case, a thread might represent a vertex, a geometry element or a pixel/fragment and a warp is a collection of about 32 of those. For pixels, they're likely to be pixels that are close to each other on screen. The problem is, if within one warp, different threads make different decisions at the conditional jump, the warp has diverged and is no longer running the same instruction for every thread. The hardware can handle this, but it's not entirely clear (to me, at least) how it does so. It's also likely to be handled slightly differently for each successive generation of cards. The newest, most general CUDA/compute-shader friendly nVidias might have the best implementation; older cards might have a poorer implementation. The worse case is you may find many threads executing both sides of if/else statements.

One of the great tricks with shaders is learning how to leverage this massively parallel paradigm. Sometimes that means using extra passes, temporary offscreen buffers and stencil buffers to push logic up out of the shaders and onto the CPU. Sometimes an optimisation may appear to burn more cycles, but it could actually be reducing some hidden overhead.

Also note that you can explicitly mark if statements in DirectX shaders as [branch] or [flatten]. The flatten style gives you the right result, but always executes all in the instructions. If you don't explicitly choose one, the compiler can choose one for you -- and may pick [flatten], which is no good for your example.

One thing to remember is that if you jump over the first texture lookup, this will confuse the hardware's texture coordinate derivative math. You'll get compiler errors and it's best not to do so, otherwise you might miss out on some of the better texturing support.

更多推荐

本文发布于:2023-07-05 07:50:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1034854.html