C ++:优化成员变量顺序?(Optimizing member variable order in C++)

编程入门 行业动态 更新时间:2024-10-15 10:20:26
C ++:优化成员变量顺序?(Optimizing member variable order in C++)

我正在阅读一个内部翻译的游戏编码器的博文 ,他正忙着挤压每一个可以从代码中删除的CPU 。 他手中提到的一个技巧是

“将类的成员变量重新排列成最常用和最少使用的”。

我不熟悉C ++,也不熟悉它如何编译,但我想知道是否

这个说法是准确的吗? 如何/为什么? 它适用于其他(编译/脚本)语言吗?

我知道这个技巧节省的(CPU)时间将会很小,这不是一个破产者。 但另一方面,在大多数函数中,确定哪些变量将是最常用的变量是相当容易的,并且默认情况下开始编码。

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to

"re-order the member variables of a class into most used and least used."

I'm not familiar with C++, nor with how it compiles, but I was wondering if

This statement is accurate? How/Why? Does it apply to other (compiled/scripting) languages?

I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

最满意答案

这里有两个问题:

是否和何时保持一定的领域是一个优化。 怎么做呢

它可能有帮助的原因是内存被加载到称为“缓存行”的块中的CPU高速缓存中。 这需要时间,一般来说,为您的对象加载更多的缓存行越长。 此外,其他更多的东西被从缓存中抛出以腾出空间,这会以不可预测的方式减慢其他代码的速度。

缓存行的大小取决于处理器。 如果与对象的大小相比较大,则很少的对象将跨越缓存线边界,所以整个优化是相当无关紧要的。 否则,你可能会有时候只有缓存中​​有一部分对象,其余的在主内存(或者L2缓存中)也许会消失。 如果您最常用的操作(访问常用字段的操作)对对象使用尽可能少的缓存,那么这是一件好事,因此将这些字段组合在一起可以让您有更好的发生机会。

一般原则称为“参考地点”。 您的程序访问不同的内存地址越接近,获得良好缓存行为的机会就越好。 事先预测性能通常很困难:同一架构的不同处理器型号可能会有不同的行为,多线程意味着您经常不知道缓存中的内容等等。但是可以谈谈可能发生的事情, 大多数时候。 如果你想知道什么,你通常必须测量它。

请注意,这里有一些问题。 如果您使用基于CPU的原子操作(C ++ 0x中的原子类型通常会),那么您可能会发现CPU锁定整个高速缓存行以锁定该字段。 那么,如果你有几个原子场靠近在一起,不同的线程运行在不同的核心上,同时在不同的字段上运行,你会发现所有这些原子操作都是序列化的,因为它们都锁定相同的内存位置,在不同领域开展业务。 如果它们在不同的缓存行上运行,那么它们将并行运行,运行速度更快。 实际上,由于Glen(通过Herb Sutter)在他的答案中指出,在一致的缓存结构中,即使没有原子操作也可能发生这种情况,并且可能会彻底毁掉你的一天。 所以参考的地方不一定是涉及多个核心的好东西,即使它们共享高速缓存。 您可以期望它是因为缓存未命中通常是丢失速度的来源,但在您的具体情况下可能是错误的。

现在,除了区分常用和较少使用的字段之外,对象越小,占用的内存越少(因此缓存越少)。 这几乎是周围的好消息,至少在那里你没有很大的争论。 对象的大小取决于其中的字段以及必须在字段之间插入的任何填充,以确保它们正确对齐架构。 C ++(有时)根据它们被声明的顺序对对象中必须出现的字段的顺序放置约束。 这是为了使低级编程更容易。 所以,如果你的对象包含:

一个int(4字节,4对齐) 后跟一个char(1个字节,任意对齐) 后跟一个int(4字节,4对齐) 后跟一个char(1个字节,任意对齐)

那么这个机会在内存中将占用16个字节。 顺便说一下,在每个平台上,int的大小和对齐方式是不一样的,但是4是很常见的,这只是一个例子。

在这种情况下,编译器将在第二个int之前插入3个字节的填充,以正确对齐它,并在最后插入3个字节的填充。 对象的大小必须是其对齐的倍数,因此可以将相同类型的对象放置在内存中。 这就是C / C ++中的一个数组,内存中的相邻对象。 如果struct是int,int,char,char,那么同一个对象可能是12个字节,因为char没有对齐要求。

我说,int是4对齐是平台依赖的:在ARM上它绝对必须是,因为未对齐的访问引发硬件异常。 在x86上,您可以访问未对齐的int,但通常较慢,IIRC非原子。 所以编译器通常(总是?)4对齐x86上的int。

编写代码时的经验法则,如果您关心打包,则是查看结构中每个成员的对齐要求。 然后首先排序最大对齐类型的字段,然后再排序下一个最小的字段,依此类推,直到没有任何符号要求的成员。 例如,如果我想编写可移植代码,我可能会想出这一点:

struct some_stuff { double d; // I expect double is 64bit IEEE, it might not be uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know uint32_t i; // 4 bytes, usually 4-aligned int32_t j; // same short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment char d; // 1 byte, any alignment };

如果你不知道一个字段的对齐方式,或者你正在编写可移植代码,但是想要尽可能的最好,你可以没有大的诡计,那么你认为对齐要求是结构中任何基本类型的最大要求,基本类型的对齐要求是它们的大小。 所以,如果你的结构体包含一个uint64_t或一个长的长,那么最好的猜测是它是8对齐的。 有时候你会错的,但是你会很正确的。

请注意,游戏程序员喜欢您的博主通常会了解其处理器和硬件的一切,因此他们无需猜测。 他们知道缓存行大小,他们知道每个类型的大小和对齐方式,他们知道它们的编译器使用的结构布局规则(对于POD和非POD类型)。 如果他们支持多个平台,那么如果需要,他们可以为每个平台提供特殊情况。 他们还花费大量的时间来思考游戏中的哪些对象将从性能改进中受益,并使用分析器来确定真正的瓶颈在哪里。 但是即使如此,使用一些您可以应用对象需要的经验法则也不是一个坏主意。 只要不会使代码不清楚,“将常用字段放在对象的开始处”和“按排序要求排序”是两个很好的规则。

Two issues here:

Whether and when keeping certain fields together is an optimization. How to do actually do it.

The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:

an int (4 bytes, 4-aligned) followed by a char (1 byte, any alignment) followed by an int (4 bytes, 4-aligned) followed by a char (1 byte, any alignment)

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:

struct some_stuff { double d; // I expect double is 64bit IEEE, it might not be uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know uint32_t i; // 4 bytes, usually 4-aligned int32_t j; // same short s; // usually 2 bytes, could be 2-aligned or unaligned, I don't know char c[4]; // array 4 chars, 4 bytes big but "never" needs 4-alignment char d; // 1 byte, any alignment };

If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

更多推荐

本文发布于:2023-07-27 20:20:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1294993.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:变量   顺序   成员   Optimizing   variable

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!