简体繁体 English

CPU绑定应用程序与IO绑定

[英]CPU bound applications vs. IO bound

原文 2009-10-26 05:36:51 1 2 language-agnostic/ data-processing

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to avoid potentially making your program IO bound while reading large related datasets, instead loading them from RAM? 对于使用大量数据的“数字运算”风格的应用程序（读取：“数百MB，但不是GB”，即它可以很好地适应操作系统旁边的内存），将所有数据读入内存是否有意义首先在开始处理之前避免在读取大型相关数据集时可能使程序IO绑定，而是从RAM中加载它们？

Does this answer change between using different data backings? 这个答案在使用不同的数据支持之间是否有所改变 ie, would the answer be the same irrespective of if you were using XML files, flat files, a full DBMS, etc? 也就是说，无论你使用的是XML文件，平面文件，完整的DBMS等，答案都是一样的吗？

2 个解决方案

Your program is as fast as whatever its bottleneck is. 你的程序和它的瓶颈一样快。 It makes sense to do things like storing your data in memory if that improves the overall performance. 如果可以提高整体性能，那么将数据存储在内存中是很有意义的。 There is no hard and fast rule that says it will improve performance however. 没有严格的规则表明它会提高性能。 When you fix one bottleneck, something new becomes the bottleneck. 当你修复一个瓶颈时，新的东西就成了瓶颈。 So resolving one issue may get a 1% increase in performance or 1000% depending on what the next bottleneck is. 因此，解决一个问题可能会使性能提高1％或1000％，具体取决于下一个瓶颈。 The thing you're improving may still be the bottleneck. 你正在改进的东西可能仍然是瓶颈。

I think about these things as generally fitting into one of three levels: 我认为这些东西通常适合三个层次之一：

Eager. 急于。 When you need something from disk or from a network or the result of a calculation you go and get or do it. 当您需要来自磁盘或网络的东西或计算结果时，您可以去做或做。 This is the simplest to program, the easiest to test and debug but the worst for performance. 这是最简单的编程，最容易测试和调试，但性能最差。 This is fine so long as this aspect isn't the bottleneck; 只要这方面不是瓶颈，这就没问题;
Lazy. 懒。 Once you've done a particular read or calculation don't do it again for some period of time that may be anything from a few milliseconds to forever. 一旦你完成了特定的读取或计算，就不要再做一段时间，可能是从几毫秒到永远。 This can add a lot of complexity to your program but if the read or calculation is expensive, can reap enormous benefits; 这会给您的程序增加很多复杂性，但如果读取或计算费用昂贵，可以获得巨大的收益; and 和
Over-eager. 过于急切。 This is much like a combination of the previous two. 这很像前两者的组合。 Results are cached but instead of doing the read or calculation or requested there is a certain amount of preemptive activity to anticipate what you might want. 结果是缓存的，但不是进行读取或计算或请求，而是有一定数量的抢先活动来预测您可能需要的内容。 Like if you read 10K from a file, there is a reasonably high likelihood that you might later want the next 10K block. 就像你从一个文件中读取10K一样，你很可能以后想要下一个10K块。 Rather than delay execution you get it just in case it's requested. 而不是延迟执行，你得到它，以防万一它的请求。

The lesson to take from this is the (somewhat over-used and often mis-quoted) quote from Donald Knuth that "premature optimization is the root of all evil." 从中得出的教训是唐纳德·克努特（Donald Knuth）引用的“有些过度使用且经常引用错误”的说法，“过早的优化是所有邪恶的根源”。 Eager and over-eager solutions add a huge amount of complexity so there is no point doing them for something that won't yield a useful benefit. 渴望和过度渴望的解决方案增加了大量的复杂性，因此没有必要为那些不会产生有用益处的事情做这些事情。

Programmers often make the mistake of creating some highly (alleged) optimized version of something before determining if they need to and whether or not it will be useful. 程序员经常犯错误，在确定是否需要以及是否有用之前创建一些高度（据称）优化版本的东西。

My own take on this is: don't solve a problem until you have a problem. 我自己拿的是这样的：直到你有个问题不解决问题。

I would guess that choosing the right data storage method will have more effect than whether you read from disk all at once or as needed. 我猜想选择正确的数据存储方法会比你是否同时或根据需要从磁盘读取更有效。

Most database tables have regular offsets for fields in each row. 大多数数据库表都有每行中字段的常规偏移量。 For example, a customer record may be 50 bytes long and have a pants_size column start at the 12th byte. 例如， customer记录可能是50个字节长，并且在第12个字节处有一个pants_size列。 Selecting all pants sizes is as easy as getting values at offsets 12, 62, 112, 162, ad nauseum . 选择所有裤子大小就像在偏移12,62,112,162和恶心中获得值一样容易。

XML, however, is a lousy format for fast data access. 但是，XML是一种用于快速数据访问的糟糕格式。 You'll need to slog through a bunch of variable-length tags and attributes in order to get your data, and you won't be able to jump instantly from one record to the next. 您需要浏览一堆可变长度的标签和属性才能获取数据，并且您将无法立即从一个记录跳转到下一个记录。 Unless you parse the file into a data structure like the one mentioned above. 除非您将文件解析为如上所述的数据结构。 In which case you'd have something very much like an RDMS, so there you go. 在这种情况下，你会有一些非常像RDMS的东西，所以你去。