简体   繁体   English

在C中写入文件时减少磁盘访问次数

[英]Reduce number of disk access while writing to file in C

I am writing a multi-threaded application and as of now I have this idea. 我正在编写一个多线程应用程序,截至目前我有这个想法。 I have a FILE*[n] where n is a number determined at runtime. 我有一个FILE*[n] ,其中n是在运行时确定的数字。 I open all the n files for reading and then multiple threads can access to read it. 我打开所有n个文件进行读取,然后多个线程可以访问以读取它。 The computation on the data of each file is equivalent ie if serial execution is supposed then each file will remain in memory for the same time. 对每个文件的数据的计算是等效的,即如果假定串行执行,那么每个文件将在同一时间保留在存储器中。

Each files can be arbitrarily large so on should not assume that they can be loaded in memory. 每个文件可以任意大,所以不应该假设它们可以加载到内存中。

Now in such a scenario I want to reduce the number of disk IO's that occur. 现在在这种情况下,我想减少发生的磁盘IO的数量。 It would be great if someone can suggest any shared memory model for such scenario (I don't know if I am using one because I have very less idea of how things are implemented) .I am not sure how should I achieve this. 如果有人能为这种情况建议任何共享内存模型会很棒(我不知道我是否正在使用它,因为我对如何实现的方式知之甚少。)我不知道应该如何实现这一点。 In other words i just want to know what is the most efficient model to implement such a scenario. 换句话说,我只想知道实现这种情况的最有效模型是什么。 I am using C . 我正在使用C

EDIT: A more detailed scenario. 编辑:一个更详细的场景。

The actual problem is I have n bloom filters for data contained in n files and once all the elements from a file are inserted in the corresponding bloom filter I need to need to do membership testing. 实际问题是我对n个文件中包含的数据有n个bloom过滤器,一旦文件中的所有元素都插入到相应的bloom过滤器中,我需要进行成员资格测试。 Since membership testing is a read-only process on data file I can read file from multiple threads and this problem can be easily parallelized. 由于成员资格测试是对数据文件的只读过程,因此我可以从多个线程读取文件,并且可以轻松地并行化此问题。 Now the number of files having data are fairly large(around 20k and note that number of files equals number of bloom filter) so I choose to spawn a thread for testing against a bloom-filter ie each bloom filter will have its own thread and that will read every other file one by one and test the membership of data against the bloom filter. 现在有数据的文件数量相当大(大约20k并注意文件数等于布隆过滤器的数量)所以我选择产生一个线程用于测试针对bloom-filter,即每个bloom过滤器都有自己的线程,将逐个读取每个其他文件,并针对布隆过滤器测试数据的成员资格。 I wan to minimize disk IO in such a case. 在这种情况下,我想尽量减少磁盘IO。

At the start use the mmap() function to map the files into memory, instead of opening/reading FILE*'s. 在开始时使用mmap()函数将文件映射到内存中,而不是打开/读取FILE *。 After that spawn the threads which read the files. 之后产生读取文件的线程。 In that way the OS buffers the accesses in memory, only performing disk io when the cache becomes full. 通过这种方式,操作系统缓存内存中的访问,仅在缓存变满时执行磁盘操作。

If your program is multi-threaded, all the threads are sharing memory unless you take steps to create thread-local storage. 如果您的程序是多线程的,除非您采取步骤创建线程本地存储,否则所有线程都在共享内存。 You don't need o/s shared memory directly. 您不需要直接使用o / s共享内存。 The way to minimize I/O is to ensure that each file is read only once if at all possible, and similarly that results files are only written once each. 最小化I / O的方法是确保每个文件尽可能只读一次,同样结果文件只写一次。

How you do that depends on the processing you're doing. 你如何做到这一点取决于你正在进行的处理。

f each thread is responsible for processing a file in its entirety, then the thread simply reads the file; f每个线程负责完整处理文件,然后线程只读取文件; you can't reduce the I/O any more than that. 你不能再减少I / O. If a file must be read by several threads, then you should try to memory map the file so that it is available to all the relevant threads. 如果一个文件必须由多个线程读取,那么您应该尝试对该文件进行内存映射,以使其可供所有相关线程使用。 If you're using a 32-bit program and the files are too big to all fit in memory, you can't necessarily do the memory mapping. 如果您使用的是32位程序且文件太大而无法完全适合内存,则无法进行内存映射。 Then you need to work out how the different threads will process each file, and try to minimize the number of times different threads have to reread the files. 然后,您需要确定不同线程将如何处理每个文件,并尝试最小化不同线程重新读取文件的次数。 If you're using a 64-bit program, you may have enough virtual memory to handle all the files via memory mapped I/O. 如果您使用的是64位程序,则可能有足够的虚拟内存来通过内存映射I / O处理所有文件。 You still want to keep the number of times that the data is accessed to a minimum. 您仍希望将访问数据的次数保持在最低限度。 Similar concepts apply to the output files. 类似的概念适用于输出文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM