简体   繁体   English

使用缓冲读取器来处理大型.csv文件,Python

[英]Using buffered reader for large .csv files, Python

I'm trying to open large .csv files (16k lines+, ~15 columns) in a python script, and am having some issues. 我试图在python脚本中打开大型.csv文件(16k行+,~15列),并且遇到了一些问题。

I use the built in open() function to open the file, then declare a csv.DictReader using the input file. 我使用内置的open()函数打开文件,然后使用输入文件声明一个csv.DictReader。 The loop is structured like this: 循环结构如下:

for (i, row) in enumerate(reader):
     # do stuff (send serial packet, read response)

However, if I use a file longer than about 20 lines, the file will open, but within a few iterations I get a ValueError: I/O operation on a closed file. 但是,如果我使用超过大约20行的文件,文件将打开,但在几次迭代中,我得到一个ValueError:对已关闭文件的I / O操作。

My thought is that I might be running out of memory (though the 16k line file is only 8MB, and I have 3GB of ram), in which case I expect I'll need to use some sort of buffer to load only sections of the file into memory at a time. 我的想法是我可能会耗尽内存(尽管16k行文件只有8MB,我有3GB内存),在这种情况下我希望我需要使用某种缓冲区来加载部分内存一次归档到内存中。

Am I on the right track? 我是在正确的轨道上吗? Or could there be other causes for the file closing unexpectedly? 或者是否有其他原因导致文件意外关闭?

edit: for about half the times I run this with a csv of 11 lines, it gives me the ValueError. 编辑:大约一半的时间我使用11行的csv运行它,它给了我ValueError。 The error does not always happen at the same line 错误并不总是发生在同一行

16k lines is nothing for 3GB Ram, most probably your problem is something else eg you are taking too much time in some other process which interferes with opened file. 16k行对于3GB Ram来说没什么,最有可能你的问题是别的,例如你在其他一些干扰打开文件的过程中花费了太多时间。 Just to be sure and anyway for speed when you have 3GB ram , load whole file in memory and then parse eg 只是为了确保速度,当你有3GB内存时,将整个文件加载到内存中,然后解析例如

import csv
import cStringIO
data = open("/tmp/1.csv").read()
reader = csv.DictReader(cStringIO.StringIO(data))
for row in reader:
    print row

In this at-least you shouldn't get file open error. 在这至少你不应该得到文件打开错误。

csv_reader is faster. csv_reader更快。 Read the whole file as blocks. 将整个文件作为块读取。 To avoid the memory leak better to use sub process. 为了避免内存泄漏更好地使用子进程。 from multiprocessing import Process 来自多处理导入过程

def child_process(name):
     # Do the Read and Process stuff here.if __name__ == '__main__':
     # Get file object resource.
      .....
     p = Process(target=child_process, args=(resource,))
     p.start()
     p.join()

For more information please go through this link. 有关更多信息,请通过此链接。 http://articlesdictionary.wordpress.com/2013/09/29/read-csv-file-in-python/ http://articlesdictionary.wordpress.com/2013/09/29/read-csv-file-in-python/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM