简体繁体 English

减少大量二进制文件从硬盘读取访问时间的替代方法

[英]Alternative to reduce large number of binary files reading access time from hard disk

原文 2012-12-17 18:52:25 0 3 c/ database-design

In my first prototype of application, I have to read around 400,000 files (each 4KB file, around total 1.5 GB data) from hard disk sequentially, and do some operation over the data read from each files, and store the results over RAM. 在我的第一个应用程序原型中，我必须依次从硬盘读取大约40万个文件（每个4KB文件，总共约1.5 GB数据），并对从每个文件读取的数据进行一些操作，然后将结果存储在RAM中。 Through this mechanism, I were first accessing I/O for a file and then utilizing CPU for operation, and keep going for another file, but it was very slow process. 通过这种机制，我首先访问一个文件的I / O，然后利用CPU进行操作，然后继续访问另一个文件，但这过程非常缓慢。

To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU). 要变通，现在我们首先读取所有文件，并将所有文件数据存储在RAM中，然后进行操作（利用CPU）。 It gave significant improvement. 它带来了显着的进步。

But in my second phase of development, I have to read 20 GB of data, which now I cannot store in RAM. 但是在开发的第二阶段，我必须读取20 GB的数据，现在我无法将其存储在RAM中。 And, single reading operation with CPU utilization is very time consuming operation. 而且，具有CPU利用率的单次读取操作非常耗时。

Can someone please suggest some method to work around this problem? 有人可以提出一些解决此问题的方法吗？

I am developing this application on Windows in C, with Visual Studio compiler. 我正在Windows中使用Visual Studio编译器开发此应用程序。

3 个解决方案

There's a technique called Asynchronous I/O (AIO) that lets you keep doing some processing with the CPU while a file is read in the background. 有一种称为异步I / O（AIO）的技术，可让您在后台读取文件的同时继续对CPU进行一些处理。 You can use this to read the next few files at the same time as you're processing a file. 您可以使用它在处理文件的同时读取接下来的几个文件。

The various AIO calls are OS-specific. 各种AIO调用是特定于OS的。 On Windows, Microsoft call it "Overlapped I/O". 在Windows上，Microsoft将其称为“重叠的I / O”。 See this Wikipedia page or this MSDN page for more info. 请参阅此Wikipedia页面或此MSDN页面以获取更多信息。

To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU). 要变通，现在我们首先读取所有文件，并将所有文件数据存储在RAM中，然后进行操作（利用CPU）。

(Assuming files can be processed independently...) （假设文件可以独立处理...）

You are half-way there. 你在那儿。 Instead of waiting until all files have been loaded to RAM, start processing as soon as any file is loaded. 无需等待所有文件都已加载到RAM，而是在加载任何文件后立即开始处理。 That would be a form of pipelining . 那将是流水线的一种形式。

You'll need three components: 您将需要三个组件：

A thread ¹ that reads files ("producer"). 读取文件的线程¹ （“生产者”）。
A thread ² that processes the files ("consumer"). 线程²处理文件（“消费者”）。
A message queue ³ between them. 它们之间的消息队列³ 。

The producer reads the files the way you are already doing it, but instead of processing them, just enqueues them to the message queue. 生产者以您已经在执行的方式读取文件，但是不处理它们，只是将它们排队到消息队列中。 The consumer thread waits until it can dequeue the file from the queue, processes it, and then immediately frees the memory that has been occupied by the file and resumes waiting to the queue. 使用者线程等待，直到它可以从队列中取出文件，对其进行处理，然后立即释放该文件已占用的内存，并继续等待队列。

In case you can process files by sequentially traversing them start-to-finish, you could even devise a more fine-grained "streaming", where files wold be both read and processed in chunks, which could lower the peak memory consumption even more (eg if you have some extra-large files that would no longer need to be kept whole in the memory). 如果可以通过从头到尾依次遍历它们来处理文件，则您甚至可以设计出更细粒度的“流”，在其中以块的形式读取和处理文件，这可以进一步降低峰值内存消耗（例如，如果您有一些超大文件，它们不再需要在内存中完整保存。

¹ Or a set of threads to parallelize the I/O, if you anticipate reading from multiple physical disks. ¹或一组线程，用于并行化I / O（如果您预期从多个物理磁盘读取）。

² Or a set of threads to saturate the CPU cores, if processing the file is not cheaper than reading it. ²如果处理文件并不比读取文件便宜，则使用一组线程使CPU内核饱和。

³ You don't need a fancy persistent distributed message queue for that. ³您不需要花哨的持久性分布式消息队列。 Just a straight in-memory queue, a-la BlockingCollection in .NET (I'm sure you'll find something similar for pure C). 只是一个直接的内存队列，.NET中的a-la BlockingCollection （我相信您会在纯C语言中找到类似的东西）。

Create threads (in loop) which will read files into RAM. 创建线程（循环）以将文件读入RAM。
Work with the data in RAM in separate thread[s] and free RAM after processing. 处理RAM中单独线程中的数据，并在处理后释放RAM。
Keep limits and a poll of records about files (read and processed) in the shared object protected by mutex. 在互斥对象保护下，对共享对象中的文件（已读和已处理）进行限制和记录轮询。
Use semaphore for resources (files in RAM) production/utilisation synchronisation. 将信号量用于资源（RAM中的文件）生产/利用同步。