简体繁体 English

为什么多线程不能改进 mmap 任务？

[英]why multi-thread cant improve a mmap task?

原文 2021-03-16 04:42:07 9 1 c++/ multithreading/ io/ mmap

I have a big task, which need to read 500 files (50G in total).我有一个大任务，需要读取 500 个文件（总共 50G）。

for every file, i should read it out, and do some calculation according to data from file.对于每个文件，我都应该将其读出，并根据文件中的数据进行一些计算。 just calculate, nothing else.只是计算，没有别的。 and i can ensure tasks are independent, just share some signleton object to read(i think that wont be the problem).而且我可以确保任务是独立的，只需分享一些符号 object 即可阅读（我认为这不会是问题）。

currently, i use mmap to get the file content's start pointer, and loop to calculate.目前，我使用 mmap 来获取文件内容的起始指针，并循环计算。

in single thread, i run the task, cost 30s,在单线程中，我运行任务，花费 30 秒，

i run it in a thread_pool, it cost me 35s（6 thread）.我在线程池中运行它，花了我 35 秒（6 线程）。

my machine is a 16G memory, 2.2G hz cpu with 8 thread.我的机器是16G memory，2.2G hz cpu，8线程。

I try a lot of setting, and carefully ensure the independent of tasks.我尝试了很多设置，并仔细确保任务的独立性。

I am not so good at hardware, is there a hard limit about IO, that limit my speed?我不太擅长硬件，关于 IO 是否有硬限制，限制我的速度？ can anyone remind me is there anything i can read?谁能提醒我有什么我可以读的吗？

sorry, the code is too complex, i cant make a valid demo here.对不起，代码太复杂了，我不能在这里做一个有效的演示。

1 个解决方案

You can try to use the MAP_POPULATE flag on mmap to read ahead if you want to load the whole file or use madvise.如果要加载整个文件或使用 madvise，可以尝试使用 mmap 上的 MAP_POPULATE 标志进行预读。

The most important hardware detail here is not mentioned, if you read from SSD or HDD but i assume you use a SSD, otherwise the thread pool code would be much much slower.这里没有提到最重要的硬件细节，如果您从 SSD 或 HDD 读取，但我假设您使用 SSD，否则线程池代码会慢得多。

I don't understand why you use mmaping here.我不明白你为什么在这里使用映射。 There are only three valid reasons to mmap a file, first the data structure on disk is complex and you like to poke around, which is slow as it makes read ahead much less efficient. mmap 文件只有三个正当理由，首先，磁盘上的数据结构很复杂，你喜欢到处乱找，这很慢，因为它使预读效率大大降低。 You need shared memory between processes.您需要在进程之间共享 memory。 Or you work on huge files and need the OS functionality to swap out data to the file when your system comes under memory stress (all databases just do it for only this single reason).或者，当您的系统受到 memory 压力时，您处理大型文件并需要操作系统功能将数据交换到文件中（所有数据库都只是为了这个单一原因才这样做）。