实时写入磁盘

Question

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.我有一个线程需要将数据从内存缓冲区写入磁盘数千次。 I have some requirements of how long each write takes because the buffer needs to be cleared for a separate thread to write to it again.我对每次写入需要多长时间有一些要求，因为需要清除缓冲区以供单独的线程再次写入。

I have tested the disk with dd.我已经用 dd 测试了磁盘。 I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag).我没有在其上使用任何文件系统并直接写入磁盘（使用直接标志打开它）。 I am able to get about 100 MB/s with a 32K block size.我能够以 32K 的块大小获得大约 100 MB/s 的速度。

In my application, I noticed I wasn't able to write data to the disk at nearly this speed.在我的应用程序中，我注意到我无法以接近这个速度将数据写入磁盘。 So I looked into what was happening and I find that some writes are taking very long.所以我调查了正在发生的事情，我发现有些写入需要很长时间。 My block of code looks like (this is in C by the way):我的代码块看起来像（顺便说一下，这是在 C 中）：

last = get_timestamp();
write();
now = get_timestamp();
if (longest_write < now - last)
  longest_write = now - last;

And at the end I print out the longest write.最后我打印出最长的写入。 I found that for a 32K buffer, I am seeing a longest write speed of about 47ms.我发现对于 32K 缓冲区，我看到最长的写入速度约为 47 毫秒。 This is way too long to meet the requirements of my application.这太长了，无法满足我的应用程序的要求。 I don't think this can be solely attributed to rotational latency of the disk.我认为这不能完全归因于磁盘的旋转延迟。 Any ideas what is going on and what I can do to get more stable write speeds?任何想法发生了什么以及我可以做些什么来获得更稳定的写入速度？ Thanks谢谢

Edit: I am in fact using multiple buffers of the type I declare above and striping between them to multiple disks.编辑：我实际上使用了我在上面声明的类型的多个缓冲区并将它们之间的条带化到多个磁盘。 One solution to my problem would be to just increase the number of buffers to amortize the cost of long writes.我的问题的一种解决方案是增加缓冲区的数量以分摊长写入的成本。 However I would like to keep the amount of memory being used for buffering as small as possible to avoid dirtying the cache of the thread that is producing the data written into the buffer.但是，我希望将用于缓冲的内存量保持在尽可能小，以避免弄脏产生写入缓冲区的数据的线程的缓存。 My question should be constrained to dealing with variance in the latency of writing a small block to disk and how to reduce it.我的问题应该仅限于处理将小块写入磁盘的延迟差异以及如何减少它。

Answer 1

I'm assuming that you are using an ATA or SATA drive connected to the built-in disk controller in a standard computer.我假设您使用的是连接到标准计算机中内置磁盘控制器的 ATA 或 SATA 驱动器。 Is this a valid assumption, or are you using anything out of the ordinary (hardware RAID controller, SCSI drives, external drive, etc)?这是一个有效的假设，还是您使用了任何不同寻常的东西（硬件 RAID 控制器、SCSI 驱动器、外部驱动器等）？

As an engineer who does a lot of disk I/O performance testing at work, I would say that this sounds a lot like your writes are being cached somewhere.作为一名在工作中进行大量磁盘 I/O 性能测试的工程师，我会说这听起来很像您的写入被缓存在某处。 Your "high latency" I/O is a result of that cache finally being flushed.您的“高延迟”I/O 是该缓存最终被刷新的结果。 Even without a filesystem, I/O operations can be cached in the I/O controller or in the disk itself.即使没有文件系统，I/O 操作也可以缓存在 I/O 控制器或磁盘本身中。

To get a better view of what is going on, record not just your max latency, but your average latency as well.为了更好地了解正在发生的事情，不仅要记录最大延迟，还要记录平均延迟。 Consider recording your max 10-15 latency samples so you can get a better picture of how (in-)frequent these high-latency samples are.考虑记录最多 10-15 个延迟样本，以便更好地了解这些高延迟样本的（不）频率。 Also, throw out the data recorded in the first two or three seconds of your test and start your data logging after that.此外，扔掉测试前两三秒记录的数据，然后开始数据记录。 There can be high-latency I/O operations seen at the start of a disk test that aren't indicative of the disk's true performance (can be caused by things like the disk having to rev up to full speed, the head having to do a large initial seek, disk write cache being flushed, etc).在磁盘测试开始时可能会出现高延迟 I/O 操作，但这些操作并不代表磁盘的真实性能（可能是由于磁盘必须加速到全速、磁头必须执行大的初始搜索，正在刷新磁盘写入缓存等）。

If you are wanting to benchmark disk I/O performance, I would recommend using a tool like IOMeter instead of using dd or rolling your own.如果您想对磁盘 I/O 性能进行基准测试，我建议您使用IOMeter之类的工具，而不是使用dd或滚动您自己的工具。 IOMeter makes it easy to see what kind of a difference it makes to change the I/O size, alignment, etc, plus it keeps track of a number of useful statistics. IOMeter 可以轻松查看更改 I/O 大小、对齐方式等的差异，此外它还跟踪许多有用的统计信息。

Requiring an I/O operation to happen within a certain amount of time is a risky thing to do.要求 I/O 操作在一定时间内发生是一件冒险的事情。 For one, other applications on the system can compete with you for disk access or CPU time and it is nearly impossible to predict their exact effect on your I/O speeds.一方面，系统上的其他应用程序可以与您竞争磁盘访问或 CPU 时间，几乎不可能预测它们对您的 I/O 速度的确切影响。 Your disk might encounter a bad block, in which case it has to do some extra work to remap the affected sectors before processing your I/O.您的磁盘可能会遇到坏块，在这种情况下，它必须在处理 I/O 之前做一些额外的工作来重新映射受影响的扇区。 This introduces an unpredictable delay.这引入了不可预测的延迟。 You also can't control what the OS, driver, and disk controller are doing.您也无法控制操作系统、驱动程序和磁盘控制器正在做什么。 Your I/O request may get backed up in one of those layers for any number of unforseeable reasons.由于各种不可预见的原因，您的 I/O 请求可能会在这些层之一中得到备份。

If the only reason you have a hard limit on I/O time is because your buffer is being re-used, consider changing your algorithm instead.如果您对 I/O 时间有严格限制的唯一原因是因为您的缓冲区被重用，请考虑更改您的算法。 Try using a circular buffer so that you can flush data out of it while writing into it.尝试使用循环缓冲区，以便您可以在写入数据时将数据刷新出来。 If you see that you are filling it faster than flushing it, you can throttle back your buffer usage.如果您发现填充它的速度比刷新它的速度快，则可以限制缓冲区的使用。 Alternatively, you can also create multiple buffers and cycle through them.或者，您也可以创建多个缓冲区并在它们之间循环。 When one buffer fills up, write that buffer to disk and switch to the next one.当一个缓冲区填满时，将该缓冲区写入磁盘并切换到下一个缓冲区。 You can be writing to the new buffer even if the first is still being written.即使第一个缓冲区仍在写入，您也可以写入新缓冲区。

Response to comment: You can't really "get the kernel out of the way", it's the lowest level in the system and you have to go through it to one degree or another.回复评论：你不能真正“让内核让开”，它是系统中的最低级别，你必须通过一个或另一个程度。 You might be able to build a custom version of the driver for your disk controller (provided it's open source) and build in a "high-priority" I/O path for your application to use.您可以为您的磁盘控制器构建自定义版本的驱动程序（前提是它是开源的）并构建一个“高优先级”I/O 路径供您的应用程序使用。 You are still at the mercy of the disk controller's firmware and the firmware/hardware of the drive itself, which you can't necessarily predict or do anything about.您仍然受磁盘控制器固件和驱动器本身的固件/硬件的支配，您不一定能预测或做任何事情。

Hard drives traditionally perform best when doing large, sequential I/O operations.传统上，硬盘驱动器在执行大型顺序 I/O 操作时性能最佳。 Drivers, device firmware, and OS I/O subsystems take this into account and try to group smaller I/O requests together so that they only have to generate a single, large I/O request to the drive.驱动程序、设备固件和操作系统 I/O 子系统会考虑到这一点，并尝试将较小的 I/O 请求组合在一起，以便它们只需要向驱动器生成单个大型 I/O 请求。 If you are only flushing 32K at a time, then your writes are probably being cached at some level, coalesced, and sent to the drive all at once.如果您一次只刷新 32K，那么您的写入可能会在某个级别缓存、合并并一次性发送到驱动器。 By defeating this coalescing, you should reduce the number of I/O latency "spikes" and see more uniform disk access times.通过阻止这种合并，您应该减少 I/O 延迟“尖峰”的数量并看到更统一的磁盘访问时间。 However, these access times will be much closer to the large times seen in your "spikes" than the moderate times that you are normally seeing.但是，与您通常看到的中等时间相比，这些访问时间将更接近“尖峰”中看到的大时间。 The latency spike corresponds to an I/O request that didn't get coalesced with any others and thus had to absorb the entire overhead of a disk seek.延迟峰值对应于未与任何其他请求合并的 I/O 请求，因此必须吸收磁盘查找的全部开销。 Request coalescing is done for a reason;请求合并是有原因的； by bundling requests you are amortizing the overhead of a drive seek operation over multiple commands.通过捆绑请求，您可以通过多个命令分摊驱动器查找操作的开销。 Defeating coalescing leads to doing more seek operations than you would normally, giving you overall slower I/O speeds.击败合并会导致执行比平时更多的搜索操作，从而使 I/O 速度整体变慢。 It's a trade-off: you reduce your average I/O latency at the expense of occasionally having an abnormal, high-latency operation.这是一种权衡：您以偶尔出现异常的高延迟操作为代价来减少平均 I/O 延迟。 It is a beneficial trade-off, however, because the increase in average latency associated with disabling coalescing is nearly always more of a disadvantage than having a more consistent access time is an advantage.然而，这是一个有益的权衡，因为与禁用合并相关的平均延迟增加几乎总是比拥有更一致的访问时间是一个优势更不利。

I'm also assuming that you have already tried adjusting thread priorities, and that this isn't a case of your high-bandwidth producer thread starving out the buffer-flushing thread for CPU time.我还假设您已经尝试调整线程优先级，并且这不是您的高带宽生产者线程因 CPU 时间而耗尽缓冲区刷新线程的情况。 Have you confirmed this?你确认了吗？

You say that you do not want to disturb the high-bandwidth thread that is also running on the system.您说您不想打扰也在系统上运行的高带宽线程。 Have you actually tested various output buffer sizes/quantities and measured their impact on the other thread?您是否实际测试了各种输出缓冲区大小/数量并测量了它们对另一个线程的影响？ If so, please share some of the results you measured so that we have more information to use when brainstorming.如果是这样，请分享您测量的一些结果，以便我们在集思广益时可以使用更多信息。

Given the amount of memory that most machines have, moving from a 32K buffer to a system that rotates through 4 32K buffers is a rather inconsequential jump in memory usage.考虑到大多数机器拥有的内存量，从 32K 缓冲区移动到通过 4 个 32K 缓冲区循环的系统是内存使用量的一个相当无关紧要的跳跃。 On a system with 1GB of memory, the increase in buffer size represents only 0.0092% of the system's memory.在具有 1GB 内存的系统上，缓冲区大小的增加仅代表系统内存的 0.0092%。 Try moving to a system of alternating/rotating buffers (to keep it simple, start with 2) and measure the impact on your high-bandwidth thread.尝试使用交替/旋转缓冲区系统（为简单起见，从 2 开始）并测量对高带宽线程的影响。 I'm betting that the extra 32K of memory isn't going to have any sort of noticeable impact on the other thread.我敢打赌，额外的 32K 内存不会对另一个线程产生任何明显的影响。 This shouldn't be "dirtying the cache" of the producer thread.这不应该是“弄脏了生产者线程的缓存”。 If you are constantly using these memory regions, they should always be marked as "in use" and should never get swapped out of physical memory.如果您经常使用这些内存区域，则应始终将它们标记为“正在使用”，并且永远不应将其换出物理内存。 The buffer being flushed must stay in physical memory for DMA to work, and the second buffer will be in memory because your producer thread is currently writing to it.被刷新的缓冲区必须保留在物理内存中才能让 DMA 工作，第二个缓冲区将在内存中，因为您的生产者线程当前正在写入它。 It is true that using an additional buffer will reduce the total amount of physical memory available to the producer thread (albeit only very slightly), but if you are running an application that requires high bandwidth and low latency then you would have designed your system such that it has quite a lot more than 32K of memory to spare.这是事实，使用一个额外的缓冲会减少可用物理内存的生产线（虽然只是非常轻微）的总量，但如果你正在运行需要高带宽和低延迟，然后应用程序，你会设计你的系统，它有超过 32K 的内存可供备用。

Instead of solving the problem by trying to force the hardware and low-level software to perform to specific performance measurements, the easier solution is to adjust your software to fit the hardware.不是通过试图强制硬件和低级软件执行特定的性能测量来解决问题，更简单的解决方案是调整您的软件以适应硬件。 If you measure your max write latency to be 1 second (for the sake of nice round numbers), write your program such that a buffer that is flushed to disk will not need to be re-used for at least 2.5-3 seconds.如果您测量最大写入延迟为 1 秒（为了获得更好的整数），请编写您的程序，以使刷新到磁盘的缓冲区在至少 2.5-3 秒内不需要重新使用。 That way you cover your worst-case scenario, plus provide a safety margin in case something really unexpected happens.这样，您就可以涵盖最坏的情况，并提供安全余量，以防发生真正意外的情况。 If you use a system where you rotate through 3-4 output buffers, you shouldn't have to worry about re-using a buffer before it gets flushed.如果您使用的系统在 3-4 个输出缓冲区之间循环，则不必担心在缓冲区被刷新之前重新使用缓冲区。 You aren't going to be able to control the hardware too closely, and if you are already writing to a raw volume (no filesystem) then there's not much between you and the hardware that you can manipulate or eliminate.您将无法过于密切地控制硬件，并且如果您已经在写入原始卷（无文件系统），那么您和您可以操纵或消除的硬件之间就没有太多东西了。 If your program design is inflexible and you are seeing unacceptable latency spikes, you can always try a faster drive.如果您的程序设计不灵活并且您看到不可接受的延迟峰值，您可以随时尝试更快的驱动器。 Solid-state drives don't have to "seek" to do I/O operations, so you should see a fairly uniform hardware I/O latency.固态驱动器不必“寻求”执行 I/O 操作，因此您应该看到相当统一的硬件 I/O 延迟。

Answer 2

As long as you are using O_DIRECT | O_SYNC只要你使用O_DIRECT | O_SYNC O_DIRECT | O_SYNC , you can use ioprio_set() to set the IO scheduling priority of your process/thread (although the man page says "process", I believe you can pass a TID as given by gettid() ). O_DIRECT | O_SYNC ，您可以使用ioprio_set()来设置进程/线程的 IO 调度优先级（尽管手册页说“进程”，但我相信您可以传递gettid()给出的 TID）。

If you set a real-time IO class, then your IO will always be given first access to the disk - it sounds like this is what you want.如果您设置了实时 IO 类，那么您的 IO 将始终首先访问磁盘 - 听起来这就是您想要的。

Answer 3

I have a thread that needs to write data from an in-memory buffer to a disk thousands of times.我有一个线程需要将数据从内存缓冲区写入磁盘数千次。

I have tested the disk with dd.我已经用 dd 测试了磁盘。 I'm not using any filesystem on it and writing directly to the disk (opening it with the direct flag).我没有在其上使用任何文件系统并直接写入磁盘（使用直接标志打开它）。 I am able to get about 100 MB/s with a 32K block size.我能够以 32K 的块大小获得大约 100 MB/s 的速度。

The dd 's block size is aligned with file system block size. dd的块大小与文件系统块大小对齐。 I guess your log file isn't.我猜你的日志文件不是。

Plus probably your application writes not only the log file, but also does some other file operations.另外，您的应用程序可能不仅会写入日志文件，还会执行其他一些文件操作。 Or your application isn't alone using the disk.或者您的应用程序并不孤单使用磁盘。

Generally, disk I/O isn't optimized for latencies, it is optimized for the throughput.通常，磁盘 I/O 没有针对延迟进行优化，而是针对吞吐量进行了优化。 High latencies are normal - and networked file systems have them even higher.高延迟很正常 - 网络文件系统的延迟更高。

In my application, I noticed I wasn't able to write data to the disk at nearly this speed.在我的应用程序中，我注意到我无法以接近这个速度将数据写入磁盘。 So I looked into what was happening and I find that some writes are taking very long.所以我调查了正在发生的事情，我发现有些写入需要很长时间。

Some writes take longer time because after some point of time you saturate the write queue and OS finally decides to actually flush the data to disk.一些写入需要更长的时间，因为在某个时间点之后，您将写入队列饱和，而操作系统最终决定将数据实际刷新到磁盘。 The I/O queues by default configured pretty short: to avoid excessive buffering and information loss due to a crash.默认情况下，I/O 队列配置得非常短：以避免由于崩溃而导致过度缓冲和信息丢失。

NB If you want to see the real speed, try setting the O_DSYNC flag when opening the file.注意：如果您想查看实际速度，请尝试在打开文件时设置O_DSYNC标志。

If your blocks are really aligned you might try using the O_DIRECT flag, since that would remove contentions (with other applications) on the Linux disk cache level.如果您的块确实对齐，您可以尝试使用O_DIRECT标志，因为这将消除 Linux 磁盘缓存级别上的争用（与其他应用程序）。 The writes would work at the real speed of the disk.写入将以磁盘的实际速度工作。

100MB/s with dd - without any syncing - is a highly synthetic benchmark, as you never know that data have really hit the disk.使用dd 100MB/s - 没有任何同步 - 是一个高度综合的基准测试，因为您永远不知道数据已经真正到达磁盘。 Try adding conv=dsync to the dd 's command line.尝试将conv=dsync添加到dd的命令行。

Also trying using larger block size.还尝试使用更大的块大小。 32K is still small. 32K还是小。 IIRC 128K size was the optimal when I was testing sequential vs. random I/O few years ago.几年前，当我测试顺序 I/O 和随机 I/O 时，IIRC 128K 大小是最佳的。

I am seeing a longest write speed of about 47ms.我看到最长的写入速度约为 47 毫秒。

"Real time" != "fast". “实时”！=“快速”。 If I define max response time of 50ms, and your app consistently responds within the 50ms (47 < 50) then your app would classify as real-time.如果我将最大响应时间定义为 50 毫秒，并且您的应用始终在 50 毫秒内响应（47 < 50），那么您的应用将被归类为实时。

I don't think this can be solely attributed to rotational latency of the disk.我认为这不能完全归因于磁盘的旋转延迟。 Any ideas what is going on and what I can do to get more stable write speeds?任何想法发生了什么以及我可以做些什么来获得更稳定的写入速度？

I do not think you can avoid the write() delays.我认为您无法避免write()延迟。 Latencies are the inherit property of the disk I/O.延迟是磁盘 I/O 的继承属性。 You can't avoid them - you have to expect and handle them.您无法避免它们 - 您必须期待并处理它们。

I can think only of the following option: use two buffers.我只能想到以下选项：使用两个缓冲区。 First would be used by write() , second - used for storing new incoming data from threads.第一个将由write() ，第二个 - 用于存储来自线程的新传入数据。 When write() finishes, switch the buffers and if there is something to write, start writing it.当write()完成时，切换缓冲区，如果有东西要写，就开始写。 That way there is always a buffer for threads to put the information into.这样，线程总是有一个缓冲区可以将信息放入其中。 Overflow might still happen if threads generate information faster than the write() can write.如果线程生成信息的速度比 write() 可以写入的速度快，则仍可能发生溢出。 Dynamically adding more buffers (up to some limit) might help in the case.在这种情况下，动态添加更多缓冲区（最多达到某个限制）可能会有所帮助。

Otherwise, you can achieve some sort of real-time-ness for (rotational) disk I/O only if your application is the sole user of the disk.否则，只有当您的应用程序是磁盘的唯一用户时，您才能实现（旋转）磁盘 I/O 的某种实时性。 (Old rule of real time applications applies: there can be only one.) O_DIRECT helps somehow to remove the influence of the OS itself from the equation. （实时应用程序的旧规则适用：只能有一个。） O_DIRECT以某种方式帮助从等式中消除操作系统本身的影响。 (Though you would still have the overhead of file system in form of occasional delays due to block allocation for the file extension. Under Linux that works pretty fast, but still can be avoided by preallocating the whole file in advance, eg by writing zeros.) If the timing is really important, consider buying dedicated disk for the job. （尽管由于文件扩展名的块分配，您仍然会以偶尔延迟的形式存在文件系统的开销。在 Linux 下，它工作得非常快，但仍然可以通过提前预分配整个文件来避免，例如通过写入零。 ) 如果时间真的很重要，请考虑为这项工作购买专用磁盘。 SSDs have excellent throughput and do not suffer from the seeking. SSD 具有出色的吞吐量，并且不会受到搜索的影响。

Answer 4

Are you writing to a new file or overwriting the same file?您是写入新文件还是覆盖同一个文件？

The big difference with dd is likely to be seek time, dd is streaming to a contigous (mostly) list of blocks, if you are writing lots of small files the head may be seeking all over the drive to allocate them.与 dd 的最大区别可能是寻道时间，dd 正在流式传输到一个连续的（主要是）块列表，如果您正在编写大量小文件，则头部可能会在整个驱动器上寻找以分配它们。

The best way of solving the problem is likely to be removing the requirement for the log to be written in a specific time.解决问题的最好方法可能是取消在特定时间写入日志的要求。 Can you use a set of buffers so that one is being written (or at least sent to the drives's buffer) while new log data is arriving into another one?您能否使用一组缓冲区，以便在新日志数据到达另一个缓冲区时写入（或至少发送到驱动器的缓冲区）一个缓冲区？

Answer 5

linux 不会直接向磁盘写入任何内容，它将使用虚拟内存，然后内核线程调用 pdflush 会将这些数据写入磁盘，可以通过 sysctl -w "" 控制 pdflush 的行为

实时写入磁盘

问题描述

5 个解决方案

解决方案1
8 已采纳 2010-08-19 23:36:22

解决方案2
2 2010-08-20 00:53:34

解决方案3
1 2010-08-19 23:00:30

解决方案4
0 2010-08-19 19:16:44

解决方案5
0 2019-10-29 05:05:06

实时写入磁盘

问题描述

5 个解决方案

解决方案1 8 已采纳 2010-08-19 23:36:22

解决方案2 2 2010-08-20 00:53:34

解决方案3 1 2010-08-19 23:00:30

解决方案4 0 2010-08-19 19:16:44

解决方案5 0 2019-10-29 05:05:06

解决方案1
8 已采纳 2010-08-19 23:36:22

解决方案2
2 2010-08-20 00:53:34

解决方案3
1 2010-08-19 23:00:30

解决方案4
0 2010-08-19 19:16:44

解决方案5
0 2019-10-29 05:05:06