简体繁体 English

使用多个线程进行硬盘争用

[英]Hard disk contention using multiple threads

原文 2017-03-12 00:50:45 2 3 c++/ multithreading

I have not performed any profile testing of this yet, but what would the general consensus be on the advantages/disadvantages of resource loading from the hard disk using multiple threads vs one thread? 我尚未对此进行任何配置文件测试，但是对于使用多线程与一个线程从硬盘加载资源的优点/缺点，普遍的共识是什么？ Note. 注意。 I am not talking about the main thread. 我不是在谈论主线程。

I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention. 我本以为使用多个“其他”线程来进行加载是没有意义的，因为HD不能一次执行两项操作，因此肯定只会导致磁盘争用。

Not sure which way to go architecturally, appreciate any advice. 不知道在架构上走哪条路，不胜感激。

EDIT: Apologies, I meant to mean an SSD drive not a magnetic drive. 编辑：抱歉，我的意思是说SSD驱动器不是磁性驱动器。 Both are HD's to me, but I am more interested in the case of a system with a single SSD drive. 两者对我来说都是高清的，但是我对具有单个SSD驱动器的系统更感兴趣。

As pointed out in the comments one advantage of using multiple threads is that a large file load will not delay the presentation of a smaller for to the receiver of the thread loader. 如注释中所指出的，使用多个线程的一个优点是大文件加载不会延迟向线程加载器的接收者呈现较小的文件。 In my case, this is a big advantage, and so even if it costs a little perf to do it, having multiple threads is desirable. 就我而言，这是一个很大的优势，因此即使这样做花费一点精力，还是需要多个线程。

I know there are no simple answers, but the real question I am asking is, what kind of performance % penalty would there be for making the parallel disk writes sequential (in the OS layer) as opposed to allowing only 1 resource loader thread? 我知道没有简单的答案，但是我要问的真正问题是，使并行磁盘顺序写入（在OS层中）而不允许只允许一个资源加载器线程，将会带来什么样的性能损失？ And what are the factors that drive this? 造成这种情况的因素是什么？ I don't mean like platform, manufacturer etc. I mean technically, what aspects of the OS/HD interaction influence this penalty? 我的意思不是平台，制造商等。从技术上讲，OS / HD交互的哪些方面会影响这一惩罚？ (in theory). （理论上）。

FURTHER EDIT: My exact use case are texture loading threads which only exist to load from HD and then "pass" them on to opengl, so there is minimal "computation in the threads (maybe some type conversion etc). In this case, the thread would spend most of its time waiting for the HD (I would of thought), and therefore how the OS-HD interaction is managed is important to understand. My OS is Windows 10. 进一步的编辑：我的确切用例是纹理加载线程，它们仅存在于从HD加载然后将它们“传递”到opengl，因此线程中的“计算”最少（可能是某种类型转换等）。线程将花费大部分时间等待HD（我想），因此了解OS-HD交互的管理方式非常重要，因为我的OS是Windows 10。

3 个解决方案

Note. 注意。 I am not talking about the main thread. 我不是在谈论主线程。

Main vs non-main thread makes zero difference to the speed of reading a disk. 主线程与非主线程对磁盘读取速度的影响为零。

I would have thought that using more than one "other" thread to do the loading to be pointless because the HD cannot do 2 things at once, and therefore would surely only cause disk contention. 我本以为使用多个“其他”线程来进行加载是没有意义的，因为HD不能一次执行两项操作，因此肯定只会导致磁盘争用。

Indeed. 确实。 Not only are the attempted parallel reads forced to wait for each other (and thus not actually be parallel), but they will also make access pattern of the disk random as opposed to sequential, which is much much slower due to disk head seek time. 尝试进行的并行读取不仅被迫彼此等待（因此实际上不是并行的），而且还将使磁盘的访问模式变得随机，而不是顺序的，这由于磁盘磁头寻道时间而要慢得多。

Of course, if you were to deal with multiple hard disks, then one thread dedicated for each drive would probably be optimal. 当然，如果要处理多个硬盘，则每个驱动器专用的一个线程可能是最佳选择。

Now, if you were using a solid state drive instead of a hard drive, the situation isn't quite so clear cut. 现在，如果您使用的是固态驱动器而不是硬盘驱动器，那么情况就不太清楚了。 Multiple threads may be faster, slower, or comparable. 多线程可能更快，更慢或相当。 There are probably many factors involved such as firmware, file system, operating system, speed of the drive relative to some other bottle neck, etc. 可能涉及许多因素，例如固件，文件系统，操作系统，驱动器相对于其他瓶颈的速度等。

In either case, RAID might invalidate assumptions made here. 无论哪种情况，RAID都可能使此处所做的假设无效。

A lot people will tell you that an HD can't do more than one thing at once. 很多人会告诉您，HD不能一次完成一件事情。 This isn't quite true because modern IO systems have a lot of indirection. 这不是很正确，因为现代IO系统具有很多间接性。 Saturating them is difficult to do with one thread. 用一个线程很难使它们饱和。

Here are three scenarios that I have experienced where multi-threading the IO helps. 这是我在多线程IO的帮助下经历的三种情况。

Sometimes the IO reading library has a non-trivial amount of computation, think about reading compressed videos, or parity checking after the transfer has happened. 有时，IO读取库的计算量非常小，可以考虑读取压缩视频，或者在传输发生后进行奇偶校验。 One example is using robocopy with multiple threads. 一个示例是使用具有多个线程的robocopy 。 Its not unusual to launch robocopy with 128 threads! 启动具有128个线程的robocopy并不罕见！
Many operating systems are designed so that a single process can't saturate the IO, because this would lead to system unresponsiveness. 设计了许多操作系统，以使单个进程无法使IO饱和，因为这会导致系统无响应。 In one case I got a 3% percent read speed improvement because I came closer to saturating the IO. 在一种情况下，由于接近饱和IO，我的读取速度提高了3％。 This is doubly true if some system policy exists to stripe the data to different drives, as might be set on a Lustre drive in a HPC cluster. 如果存在某些系统策略来将数据分条到不同的驱动器（如在HPC群集中的Lustre驱动器上设置），则这是双重事实。 For my application, the optimal number of threads was two. 对于我的应用程序，最佳线程数是两个。
More complicated IO, like a RAID card, contains a substantial cache that keep the HD head constantly reading and writing. 像RAID卡一样，更复杂的IO包含大量的缓存，可保持HD磁头不断读写。 To get optimal throughput you need to be sure that whenever the head is spinning its constantly reading/writing and not just moving. 为了获得最佳的吞吐率，您需要确保无论何时磁头旋转，它都会不断地进行读取/写入，而不仅仅是移动。 The only way to do this is, in practice, is to saturate the card's on-board RAM. 实际上，这样做的唯一方法是使卡的板载RAM饱和。

So, many times you can overlap some minor amount of computation by using multiple threads, and stuff starts getting tricky with larger disk arrays. 因此，很多时候，您可以使用多个线程来重叠少量的计算，而使用更大的磁盘阵列时，事情开始变得棘手。

Not sure which way to go architecturally, appreciate any advice. 不知道在架构上走哪条路，不胜感激。

Determining the amount of work per thread is the most common architectural optimization. 确定每个线程的工作量是最常见的体系结构优化。 Write code so that its easy to increase the IO worker count. 编写代码，以便轻松增加IO worker数量。 You're going to need to benchmark. 您将需要进行基准测试。

It depends on how much processing of the data you're going to do. 这取决于要处理的数据量。 This will determine whether the application is I/O you bound or compute bound. 这将确定应用程序是您绑定的I / O还是计算绑定的。

For example, if all you are going to do to the data is some simple arithmetic, eg add 1, then you will end up being I/O bound. 例如，如果您要对数据做的只是一些简单的算术，例如加1，那么最终将受I / O约束。 The CPU can add 1 to data far quicker than any I/O system can deliver flows of data. CPU可以将数据加1的速度远远快于任何I / O系统传递数据流的速度。

However, if you're going to do a large amount of work on each batch of data, eg a FFT, then a filter, then a convolution (I'm picking random DSP routine names here), then it's likely that you will end up being compute bound; 但是，如果您要对每批数据进行大量工作，例如FFT，过滤器，然后进行卷积（我在这里选择随机DSP例程名称），那么您很可能会结束受计算限制； the CPU cannot keep up with the data being delivered by the I/O subsystem which owns your SSD. CPU无法跟上拥有SSD的I / O子系统传递的数据。

It is quite an art to judge just how an algorithm should be structured to match the underlying capabilities of the underlying machine, and vice versa. 判断仅应如何构造算法以匹配基础计算机的基础功能，这是一种艺术，反之亦然。 There's profiling tools like FTRACE/Kernelshark, Intel's VTune, which are both useful in analysing exactly what is going on. 有诸如FTRACE / Kernelshark，Intel的VTune之类的性能分析工具，它们都可用于准确分析正在发生的事情。 Google does a lot to measure how many searches-per-Watt their hardware accomplishes, power being their biggest cost. Google会做很多工作来衡量其硬件完成的每瓦搜索数量，而功耗是最大的成本。

In general I/O of any sort, even a big array of SSDs, is painfully slow. 通常，任何类型的I / O，甚至包括大量SSD，都非常缓慢。 Even the main memory in a PC (DDR4) is painfully slow in comparison to what the CPU can consume. 与CPU消耗的内存相比，即使PC（DDR4）中的主内存也非常缓慢。 Even the L3 and L2 caches are sluggards in comparison to the CPU cores. 与CPU内核相比，即使是L3和L2高速缓存也很慢。 It's hard to design and multi-threadify an algorithm just right so that the right amount of work is done on each data item whilst it is in L1 cache so that the L2, L3 caches, DDR4 and I/O subsystems can deliver the next data item to the L1 caches just in time to keep the CPU cores busy. 很难正确地设计算法并对其进行多线程化，以使每个数据项都处于L1高速缓存中时在每个数据项上完成正确的工作量，以便L2，L3高速缓存，DDR4和I / O子系统可以交付下一个数据L1项会及时缓存，以保持CPU内核繁忙。 And the ideal software design for one machine is likely hopeless on another with a different CPU, or SSD, or memory SIMMs. 一台机器的理想软件设计可能在另一台具有不同CPU，SSD或内存SIMM的机器上绝望。 Intel design for good general purpose computer performance, and actually extracting peak performance from a single program is a real challenge. 英特尔为实现良好的通用计算机性能而设计，实际上要从单个程序中获得最高性能是一个真正的挑战。 Libraries like Intel's MKL and IPP are very big helps in doing this. 像英特尔的MKL和IPP这样的库在执行此操作方面有很大帮助。

General Guidance 一般指导

In general one should look at it in terms of data bandwidth required by any particular arrangement of threads and work those threads are doing. 通常，应该根据任何特定线程安排所需的数据带宽来查看它，并研究这些线程正在执行的工作。

This means benchmarking your program's inner processing loop and measuring how much data it processed and how quickly it managed to do it in, choosing an number of data items that makes sense but much more than the size of L3 cache. 这意味着要对程序的内部处理循环进行基准测试，并测量程序处理了多少数据以及设法处理了多少数据，选择了一些有意义的数据项，但这些数据项比L3缓存的大小大得多。 A single 'data item' is an amount of input data, the amount of corresponding output data, and any variables used processing the input to the output, the total size of which fits in L1 cache (with some room to spare). 单个“数据项”是输入数据量，相应的输出数据量以及用于处理输入到输出的所有变量，其总大小适合L1高速缓存（有一定的备用空间）。 And no cheating - use the CPUs SSE/AVX instructions where appropriate, don't forego them by writing plain C or not using something like Intel's IPP/MKL. 绝不作弊-在适当的地方使用CPU SSE / AVX指令，不要通过编写纯C或不使用诸如Intel的IPP / MKL之类的方式来放弃它们。 [Though if one is using IPP/MKL, it kinda does all this for you to the best of its ability.] [尽管有人正在使用IPP / MKL，但它会尽力为您完成所有这些工作。]

These days DDR4 memory is going to be good for anything between 20 to 100GByte/second (depending on what CPU, number of SIMMs, etc), so long as your not making random, scattered accesses to the data. 如今，只要您不对数据进行随机分散的访问，DDR4内存将适用于20至100GByte /秒之间的任何速度（取决于CPU，SIMM的数量等）。 By saturating the L3 your are forcing yourself into being bound by the DDR4 speed. 通过使L3饱和，您将被DDR4速度所束缚。 Then you can start changing your code, increasing the work done by each thread on a single data item. 然后，您可以开始更改代码，从而增加每个线程在单个数据项上完成的工作。 Keep increasing the work per item and the speed will eventually start increasing; 继续增加每个项目的工作量，速度最终将开始增加； you've reached the point where you are no longer limited by the speed of DDR4, then L3, then L2. 您已经达到了不再受DDR4，L3和L2速度限制的地步。

If after this you can still see ways of increasing the work per data item, then keep going. 如果在此之后您仍然可以看到增加每个数据项工作量的方法，请继续。 You eventually get to a data bandwidth somewhere near that of the IO subsystems, and only then will you be getting the absolute most out of the machine. 最终，您将获得接近IO子系统带宽的数据带宽，只有这样，您才能从机器中获得最大的收益。

It's an iterative process, and experience allows one to short cut it. 这是一个反复的过程，经验使人们可以简化它。

Of course, if one runs out of ideas for things to increase the work done per data item then that's the end of the design process. 当然，如果某人对某件事的想法用光了，以增加每个数据项完成的工作，那么这就是设计过程的终点。 More performance can be achieved only by improving the bandwidth of whatever has ended up being the bottleneck (almost certainly the SSD). 仅通过改善最终成为瓶颈的任何内容（几乎可以肯定是SSD）的带宽，才能实现更高的性能。

For those of us who like doing this software of thing, the PS3's Cell processor was a dream. 对于喜欢使用此软件的人来说，PS3的Cell处理器是一个梦想。 No need to second guess the cache, there was none. 无需第二次猜测缓存，没有缓存。 One had complete control over what data and code was where and when it was there. 一个人可以完全控制什么数据和代码在何处以及何时存在。