简体繁体 English

我该如何优化这个文件系统I / O绑定程序？

[英]How should I optimize this filesystem I/O bound program?

原文 2009-10-20 13:27:31 0 7 python/ performance/ optimization/ file-io

I have a python program that does something like this: 我有一个python程序，它做这样的事情：

Read a row from a csv file. 从csv文件中读取一行。
Do some transformations on it. 对它做一些转换。
Break it up into the actual rows as they would be written to the database. 将它们分解为实际的行，因为它们将被写入数据库。
Write those rows to individual csv files. 将这些行写入单个csv文件。
Go back to step 1 unless the file has been totally read. 除非文件已被完全读取，否则请返回步骤1。
Run SQL*Loader and load those files into the database. 运行SQL * Loader并将这些文件加载到数据库中。

Step 6 isn't really taking much time at all. 第6步并没有真正花费太多时间。 It seems to be step 4 that's taking up most of the time. 似乎步骤4占据了大部分时间。 For the most part, I'd like to optimize this for handling a set of records in the low millions running on a quad-core server with a RAID setup of some kind. 在大多数情况下，我想优化它来处理在具有某种RAID设置的四核服务器上运行的数百万的记录集。

There are a few ideas that I have to solve this: 我必须解决这个问题：

Read the entire file from step one (or at least read it in very large chunks) and write the file to disk as a whole or in very large chunks. 从第一步读取整个文件（或至少以非常大的块读取）并将文件整体或以非常大的块写入磁盘。 The idea being that the hard disk would spend less time going back and forth between files. 这个想法是硬盘在文件之间来回花费的时间更少。 Would this do anything that buffering wouldn't? 这会做什么缓冲不会？
Parallelize steps 1, 2&3, and 4 into separate processes. 将步骤1,2和3和4并行化为单独的过程。 This would make steps 1, 2, and 3 not have to wait on 4 to complete. 这将使步骤1,2和3不必等待4完成。
Break the load file up into separate chunks and process them in parallel. 将加载文件分解为单独的块并并行处理它们。 The rows don't need to be handled in any sequential order. 不需要按任何顺序处理行。 This would likely need to be combined with step 2 somehow. 这可能需要以某种方式与步骤2结合。

Of course, the correct answer to this question is "do what you find to be the fastest by testing." 当然，这个问题的正确答案是“通过测试做你认为最快的事情”。 However, I'm mainly trying to get an idea of where I should spend my time first. 但是，我主要想弄清楚我应该先把时间花在哪里。 Does anyone with more experience in these matters have any advice? 在这些问题上有更多经验的人有什么建议吗？

7 个解决方案

Poor man's map-reduce: 穷人的地图减少：

Use split to break the file up into as many pieces as you have CPUs. 使用split将文件分解为与CPU相同的多个部分。

Use batch to run your muncher in parallel. 使用批处理并行运行muncher。

Use cat to concatenate the results. 使用cat连接结果。

If you are I/O bound, the best way I have found to optimize is to read or write the entire file into/out of memory at once, then operate out of RAM from there on. 如果您受I / O限制，我发现优化的最佳方法是立即读取或写入整个文件进出内存，然后从那里开始运行RAM。

With extensive testing I found that my runtime eded up bound not by the amount of data I read from/wrote to disk, but by the number of I/O operations I used to do it. 通过广泛的测试，我发现我的运行时不是通过我从磁盘读取/写入的数据量，而是通过我用来执行它的I / O操作数量。 That is what you need to optimize. 这就是您需要优化的地方。

I don't know Python, but if there is a way to tell it to write the whole file out of RAM in one go, rather than issuing a separate I/O for each byte, that's what you need to do. 我不知道Python，但如果有办法告诉它一次性从RAM中写出整个文件，而不是为每个字节发出单独的I / O，那就是你需要做的。

Of course the drawback to this is that files can be considerably larger than available RAM. 当然，缺点是文件可能比可用的RAM大得多。 There are lots of ways to deal with that, but that is another question for another time. 有很多方法可以解决这个问题，但这是另一个问题。

Python already does IO buffering and the OS should handle both prefetching the input file and delaying writes until it needs the RAM for something else or just gets uneasy about having dirty data in RAM for too long. Python已经进行了IO缓冲，操作系统应该处理预取输入文件和延迟写入，直到它需要RAM来处理其他内容，或者只是对RAM中的脏数据持续太长时间感到不安。 Unless you force the OS to write them immediately, like closing the file after each write or opening the file in O_SYNC mode. 除非您强制操作系统立即写入它们，例如在每次写入后关闭文件或在O_SYNC模式下打开文件。

If the OS isn't doing the right thing, you can try raising the buffer size (third parameter to open() ). 如果操作系统没有做正确的事情，您可以尝试提高缓冲区大小（第三个参数为open() ）。 For some guidance on appropriate values given a 100MB/s 10ms latency IO system a 1MB IO size will result in approximately 50% latency overhead, while a 10MB IO size will result in 9% overhead. 对于适当值的一些指导，给定100MB / s 10ms延迟IO系统，1MB IO大小将导致大约50％的延迟开销，而10MB IO大小将导致9％的开销。 If its still IO bound, you probably just need more bandwidth. 如果它仍然是IO绑定，您可能只需要更多带宽。 Use your OS specific tools to check what kind of bandwidth you are getting to/from the disks. 使用特定于操作系统的工具来检查您从磁盘获取/从哪种带宽。

Also useful is to check if step 4 is taking a lot of time executing or waiting on IO. 同样有用的是检查步骤4是否花费了大量时间执行或等待IO。 If it's executing you'll need to spend more time checking which part is the culprit and optimize that, or split out the work to different processes. 如果它正在执行，你需要花更多的时间来检查哪个部分是罪魁祸首并对其进行优化，或者将工作分成不同的流程。

Can you use a ramdisk for step 4? 你可以在第4步使用ramdisk吗？ Low millions sounds doable if the rows are less than a couple of kB or so. 如果行小于几个KB左右，则数百万个声音可行。

Use buffered writes for step 4. 对第4步使用缓冲写入。

Write a simple function that simply appends the output onto a string, checks the string length, and only writes when you have enough which should be some multiple of 4k bytes. 编写一个简单的函数，只需将输出附加到字符串上，检查字符串长度，只有在有足够的时候写入，应该是4k字节的倍数。 I would say start with 32k buffers and time it. 我会说从32k缓冲区开始计时。

You would have one buffer per file, so that most "writes" won't actually hit the disk. 每个文件都有一个缓冲区，因此大多数“写入”实际上不会到达磁盘。

Isn't it possible to collect a few thousand rows in ram, then go directly to the database server and execute them? 是不是可以在ram中收集几千行，然后直接进入数据库服务器并执行它们？

This would remove the save to and load from the disk that step 4 entails. 这将删除步骤4所需的磁盘保存和加载。

If the database server is transactional, this is also a safe way to do it - just have the database begin before your first row and commit after the last. 如果数据库服务器是事务性的，这也是一种安全的方法 - 只需让数据库在第一行之前开始并在最后一行之后提交。

The first thing is to be certain of what you should optimize. 首先要确定你应该优化什么。 You seem to not know precisely where your time is going. 你似乎不知道你的时间到底在哪里。 Before spending more time wondering, use a performance profiler to see exactly where the time is going. 在花费更多时间思考之前，请使用性能分析器来确切了解时间的变化。

http://docs.python.org/library/profile.html http://docs.python.org/library/profile.html

When you know exactly where the time is going, you'll be in a better position to know where to spend your time optimizing. 当您确切知道时间的去向时，您将能够更好地了解在何处花时间进行优化。