简体   繁体   English

C ++从文件的几个部分读取太慢

[英]C++ Reading from several sections of a file is too slow

I need to read byte arrays from several locations of a big file. 我需要从一个大文件的几个位置读取字节数组。 I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible. 我已经对文件进行了优化,以便尽可能少地阅读部分,并且这些部分尽可能紧密地结合在一起。

I have 20 calls like this one: 我有20个这样的电话:

m_content.resize(iByteCount);

fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile); 

iByteCount is around 5000 on average. iByteCount平均约为5000。

Before using fread, I used a memory-mapped file, but the results were approximately the same. 在使用fread之前,我使用了内存映射文件,但结果大致相同。

My calls are still too slow (around 200 ms) when called for the first time. 第一次呼叫时,我的呼叫仍然太慢(大约200毫秒)。 When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me. 当我用相同的字节部分重复相同的调用来读取时,它非常快(大约1毫秒),但这对我没有帮助。

The file is big (around 200 mb). 文件很大(大约200 MB)。 After this call, I have to read double values from a different section of the file, but I can not avoid this. 在此调用之后,我必须从文件的不同部分读取双值,但我无法避免这种情况。

I don't want to split it up in 2 files. 我不想将其拆分为2个文件。 I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow. 我已经看到了其他人使用的“巨大文件方法”,他们以某种方式克服了这个问题。

If I use memory-mapping, the first call of reading is always slow. 如果我使用内存映射,第一次读取调用总是很慢。 If I then repeat reading from this section, it is lightening fast. 如果我再重复阅读本节,它会快速闪电。 When I then read from a different section, it is slow for the first time, but then lightening fast the second time. 当我从另一个部分读取时,它第一次很慢,但第二次快速闪电。

I have no idea why this is so. 我不知道为什么会这样。

Does anybody have any more ideas for me? 有没有人对我有任何想法? Thank you. 谢谢。

Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth. 磁盘驱动器有两个(实际上是三个)限制其速度的因素:访问时间,顺序带宽和总线延迟/带宽。

What you feel most is access time. 你最感觉的是访问时间。 Access time is typically in the millisecond ballpark. 访问时间通常在毫秒球场。 Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. 在典型的硬盘上,必须进行搜索需要超过5(通常超过10)毫秒。 Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average"). 请注意,打印在磁盘驱动器上的数字是“平均”时间,而不是最差时间(在某些情况下,它似乎更接近“最佳”而不是“平均”)。

Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). 即使对于慢速磁盘,顺序读取带宽通常高达60-80 MiB / s,对于更快的磁盘(或固态> 400MiB),顺序读取带宽通常为120-150 MiB / s。 Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB). 总线带宽和延迟是您通常不关心的事情,因为总线速度通常超过驱动器速度(除非您在SATA-2上使用现代固态磁盘,或在SATA-1上使用15k硬盘,或通过USB使用任何磁盘) )。

Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. 另请注意,您无法更改驱动器的带宽,也无法更改总线带宽。 Nor can you change the seek time. 你也不能改变寻找时间。 However, you can change the number of seeks. 但是,您可以更改搜索次数

In practice, this means you must avoid seeks as much as you can . 在实践中,这意味着你必须尽可能地避免寻求 If that means reading in data that you do not need, do not be afraid of doing so. 如果这意味着读取您不需要的数据,请不要害怕这样做。 It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB. 读取100 kiB比读取5 kiB 快得多,向前读取90 KB,再读取5 kiB。

If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. 如果可以的话,一次阅读整个文件,只使用你感兴趣的部分.200 MiB不应该是现代计算机上的一大障碍。 Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). 然而,在200 MiB中读取fread到分配的缓冲区可能是禁止的(这取决于您的目标体系结构,以及您的程序正在做什么)。 But don't worry, you have already had the best solution to the problem: memory mapping. 但不要担心,您已经拥有了解决问题的最佳解决方案:内存映射。
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get. 虽然记忆映射不是一个“神奇的加速器”,但它仍然尽可能接近“神奇”。

The big advantage of memory mapping is that you can directly read from the buffer cache. 内存映射的一大优点是可以直接从缓冲区缓存中读取。 Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". 这意味着操作系统将预取页面,你甚至可以要求它更积极地预取,所以你的所有读取都将是“即时的”。 Also, what is stored in the buffer cache is in some sense "free". 此外,存储在缓冲区高速缓存中的内容在某种意义上是“免费的”。
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive). 不幸的是,内存映射并不总是很容易(特别是因为操作系统通常提供的文档和提示标志具有欺骗性或适得其反的效果)。

While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. 虽然您无法保证已经读取的内容一旦保留在缓冲区中,但实际情况是任何“合理”大小的情况。 Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. 当然,操作系统不能也不会将数TB的数据保存在RAM中,但是大约200 MiB的内容将非常可靠地保留在“普通”现代计算机上的缓冲区中。 Reading from buffers works more or less in zero time. 从缓冲区读取或多或少在零时间内工作。
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. 因此,您的目标是让操作系统尽可能按顺序将文件读​​入其缓冲区。 Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow). 除非机器耗尽物理内存,因此它被迫丢弃缓冲区页面,这将是快速的(如果发生这种情况,其他所有解决方案都会同样慢)。

Linux has the readahead syscall which lets you prefetch data. Linux有readahead系统调用,可以预取数据。 Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). 不幸的是,它会一直阻塞,直到获取数据,这不是您可能想要的(因此您必须使用额外的线程)。 madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. madvise(MADV_WILLNEED)是一个不太可靠,但可能更好的选择。 posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (ie 256kiB). posix_fadvise也可以工作,但请注意Linux将readahead限制为默认预读大小的两倍(即256kiB)。
Do not have yourself being fooled by the docs, as the docs are deceptive. 不要让自己被文档愚弄,因为文档具有欺骗性。 It may seem that MADV_RANDOM is a better choice, as your access is "random". 看起来MADV_RANDOM是更好的选择,因为您的访问是“随机的”。 It makes sense to be honest to the OS about what you're doing, doesn't it? 对操作系统诚实地说你正在做什么是不合理的,不是吗? Usually yes, but not here. 通常是的,但不是这里。 This, simply turns off prefetching , which is the exact opposite of what you really want. 这只是关闭预取 ,这与你真正想要的完全相反。 I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance. 我不知道这背后的理由,也许是一些不明智的尝试来收敛记忆 - 无论如何它都会对你的表现产生不利影响。

Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. Windows(因为Windows 8,仅适用于桌面)具有PrefetchVirtualMemory ,它可以完全满足人们的需求,但不幸的是它只能在最新版本上使用。 On older versions, there is just... nothing. 在旧版本中,只有...没有。

A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. 在映射中填充页面的一种非常简单,高效且可移植的方法是启动一个错误每个页面的工作线程。 This sounds horrendous, but it works very nicely, and is operating-system agnostic. 这听起来很可怕,但它的工作非常好,并且与操作系统无关。
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; 类似volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. 完全足够了。 I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU. 我正在使用这样的代码在访问它们之前对故障前页面进行预故障,它的工作速度与任何其他填充缓冲区的方法无关,并且使用的CPU很少。

(moved to an answer as requested by the OP) (根据OP的要求转移到答案)

You cannot read from a file any quicker (there is no magic flag to say "read faster"). 你无法更快地从文件中读取(没有神奇的标志可以说“读得更快”)。 There is either an issue with your hardware or 200mS is how long it is supposed to take 您的硬件存在问题,或200mS存在问题需要多长时间

1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time . 1)第一次读取和后续读取之间访问速度的差异是完全可以理解的:第一次调用实际上是从磁盘读取文件, 这需要时间 However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms). 但是你的内核(不提磁盘控制器)会保持被访问的数据被缓冲,因此当你第二次访问它时,它是纯内存访问(1ms)。 Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. 即使您只需要访问文件的很小一部分,libc / kernel / controller优化也会在相当大的块中访问磁盘。 You can read the libc/OS/controller doc to try and align your reads on these chunks. 您可以阅读libc / OS / controller doc以尝试在这些块上对齐读取。

2) You're using stream input, try using direct open / read / close functions : low-level I/O have less overhead (obviously). 2)您正在使用流输入,尝试使用直接open / read / close功能:低级I / O具有较少的开销(显然)。 Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue. 没有什么比这更快,所以如果你仍然觉得这个太慢,你就会遇到操作系统或硬件问题。

as it look you have a good benchmark, try to switch the size and the count in your fread call. 因为它看起来你有一个很好的基准,尝试切换你的fread电话中的大小和计数。 reading 1 times 1000 bytes will be faster than 1000 x 1 byte. 读取1次1000字节将比1000 x 1字节快。

Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. 磁盘速度很慢,正如您所指出的那样,延迟来自第一次访问 - 即磁盘旋转并访问必要的扇区。 You're always going to pay that cost one time. 你总是会花一次这笔钱。

You could improve your performance a little by using memory mapped IO. 您可以通过使用内存映射IO来提高性能。 See either mmap (Linux) or CreateFileMapping + MapViewOfFile (Windows). 请参阅mmap (Linux)或CreateFileMapping + MapViewOfFile (Windows)。

I have already optimized the file so that as few sections as possible have to be read 我已经对文件进行了优化,以便尽可能少地阅读文件

Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest. 如果我错了,请纠正我,但是在参考正在优化的文件时,我假设你的意思是你已经订购了这些部分,以尽量减少发生的读取次数,而不是我要建议的内容。

Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited. 受IO限制可能是由于寻道时间,因此除了获得更快的存储介质外,您的选择也是有限的。

Two possible ideas I had are: - 我有两个可能的想法: -

1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. 1)压缩存储的数据,这可能会使您的读取时间稍微快一些,但仍然无助于查找时间。 You'd have to test if this benefits at all. 您必须测试这是否有益。

2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. 2)如果相关,一旦您检索到一个数据块,将其移动到一个线程并开始处理它,同时进行另一个读取。 You may be doing this already, but if not, I thought it worth mentioning. 你可能已经这样做了,但如果没有,我认为值得一提。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM