如何设计性能对象

Question

Whilst reading a book about physics engine development recently, I came across a design decision which I have never even considered before. 在最近读一本关于物理引擎开发的书时，我遇到了一个我以前从未考虑过的设计决策。 This relates to the way the raw bytes in memory are addressed by the CPU. 这与CPU处理内存中的原始字节的方式有关。

Consider the following class: 考虑以下课程：

class Foo
{
    public:
        float x;
        float y;
        float z;

        /* Constructors and Methods */

    private:
        float padding;
}

The author claims that the padding, increasing the size of the object to a quad-word in x86 architecture, results in a noticeable benefit to performance. 作者声称填充，在x86架构中将对象的大小增加到四字，可以显着提高性能。 This is because 4 words sit more cleanly in memory than 3, what does this mean? 这是因为4个单词在内存中比3更干净，这是什么意思？ Padding out an object with redundant data to increase performance seems pretty paradoxical to me. 用冗余数据填充对象以提高性能对我来说似乎很矛盾。

This also begs another question, what about objects that are 1 or 2 words in size? 这也引出了另一个问题，那些大小为1或2个单词的对象呢？ If my class is something like: 如果我的班级是这样的：

class Bar
{
    public:
        float x;
        float y;

        /* Constructors and Methods */

    private:
        /* padding ?? */
}

Should I add padding to this class so that it sits more cleanly in memory? 我应该在这个类中添加填充，以便它在内存中更干净地放置吗？

Answer 1

It is the compiler's responsibility to decide what reasonable padding (assuming typical access patterns) is. 编译器有责任决定合理的填充（假设是典型的访问模式）。 The compiler does know a whole lot more about your machine than you'll ever will. 编译器确实比你想象的更了解你的机器。 Besides, your machine will be with you a couple of years; 此外，你的机器将与你在一起几年; the program will be around for decades, running on a wide range of platforms, subject to a mind boggling variety of usage patterns. 该计划将持续数十年，在各种平台上运行，受到令人难以置信的各种使用模式的影响。 What is the best for today's i7 could very well be the worst for tomorrow's i8 or ARMv11. 对于今天的i7来说，最好的是明天的i8或ARMv11。

Obfuscating code in pursuit of elusive "performance" falls squarely into premature optimization . 为追求难以捉摸的“性能”而混淆的代码完全属于过早优化。 Always remember that your time (writing, testing, debugging, understanding again after a week's time, on tweaked code) is much, much more expensive than the possibly wasted computer time (unless said code is run thousands of times a day on millions of machines, that is). 永远记住你的时间（写作，测试，调试，一周后再次理解，调整代码）比可能浪费的计算机时间要贵得多（除非所说的代码每天在数百万台机器上运行数千次），那是）。 Code tweaking makes no sense at all until you have hard facts showing that the performance isn't enough, and measurements telling you that shuffling that structure around is a bottleneck worth worrying about. 代码调整完全没有任何意义，直到你有足够的事实表明性能不够， 测量结果告诉你，改变这种结构是一个值得担心的瓶颈。

Answer 2

Processors doesnt "read" the memory byte by byte as humans, they process it chunk by chunk, of variable sizes depending of the processor. 处理器并不像人类那样逐字节地“读取”存储器，它们按块处理块，可变大小取决于处理器。 It's called memory access granularity; 它被称为内存访问粒度;

By "memory aligning" your object, the acess time may be faster and you can also avoid data fragmentation. 通过“内存对齐”您的对象，访问时间可能更快，您还可以避免数据碎片。

You can read more about data alignement here 您可以在此处阅读有关数据对齐的更多信息

Edit: I'm not saying that it's a good or bad practice, just sharing what I know about it. 编辑：我不是说这是一个好的或坏的做法，只是分享我所知道的。

Answer 3

There are two really important things to say in answer to this question. 在回答这个问题时，有两个非常重要的事情要说。

First, if you're going to tweak code for performance benefits, and if you've decided it's worthwhile (for whatever reason), you must first write a benchmark. 首先，如果您要调整代码以获得性能优势，并且如果您认为值得（无论出于何种原因），您必须首先编写基准。 You must be able to try both and measure the difference. 你必须能够尝试两者并衡量差异。

Second, tweaks of this kind will depend on how the assembly language interacts with the hardware. 其次，这种调整将取决于汇编语言如何与硬件交互。 You must be able to read assembly language code and understand the different instructions sets and hardware accessing modes in order to understand why these tweaks might work. 您必须能够阅读汇编语言代码并理解不同的指令集和硬件访问模式，以便了解这些调整可能起作用的原因。

Finally, your question has no answer in isolation. 最后，你的问题没有孤立的答案。 It depends on whether those objects are allocated individually or are in collections; 这取决于这些对象是单独分配还是集合; whether there are other objects next to them; 他们旁边是否还有其他物品; and how the compiler generates code for each case. 以及编译器如何为每种情况生成代码。 In all likelihood alignment on a power-of-two boundary will be faster than misalignment, but a collection that fits in a cache is faster than one that doesn't. 在所有可能的情况下，二次幂边界上的对齐将比未对齐更快，但是适合高速缓存的集合比不这样做的集合更快。 I wouldn't expect padding 8 or 4 bytes to improve performance, but if it was important, I would try it and test the result. 我不希望填充8或4字节来提高性能，但如果它很重要，我会尝试并测试结果。

如何设计性能对象

问题描述

3 个解决方案

解决方案1
9 已采纳 2014-03-07 13:27:24

解决方案2
3

解决方案3
1 2014-03-07 13:37:48

如何设计性能对象

问题描述

3 个解决方案

解决方案1 9 已采纳 2014-03-07 13:27:24

解决方案2 3

解决方案3 1 2014-03-07 13:37:48

解决方案1
9 已采纳 2014-03-07 13:27:24

解决方案2
3

解决方案3
1 2014-03-07 13:37:48