简体   繁体   English

阅读csv文件的特定行

[英]Read specific lines of csv file

Hllo guys, so i have a huge CSV file (500K of lines), i want to process the file simultaneously with 4 processes (so each one will read aprox. 100K of lines) what is the best way to do it using multi proccessing? Hllo伙计们,所以我有一个巨大的CSV文件(500K的行),我想同时处理4个进程的文件(所以每个人都会阅读aprox.100K的行)使用多进程的最佳方法是什么?

what i have up til now: 我到现在为止:

def csv_handler(path, procceses = 5):

    test_arr = []
    with open(path) as fd:
        reader = DictReader(fd)

        for row in reader:
            test_arr.append(row)

    current_line = 0
    equal_length = len(test_arr) / 5

    for i in range(5):
        process1 = multiprocessing.Process(target=get_data, args=(test_arr[current_line: current_line + equal_length],))
        current_line = current_line + equal_length

i know it's a bad udea to do that with one reading line, but i don't find another option.. i would be happy to get some ideas to how to do it in a better way! 我知道用一条阅读线来做这件事是不好的,但我没有找到另一种选择..我很乐意以更好的方式得到一些想法!

CSV is a pretty tricky format to split the reads up with, and other file formats may be more ideal. CSV是一种非常棘手的分割读取格式,其他文件格式可能更理想。

The basic problem is that as lines may be different lengths, you can't know where to start reading a particular lines easily to " fseek " to it. 基本问题是,由于线条长度可能不同,因此无法知道从哪里开始轻松读取特定线条以“ fseek ”。 You would have to scan through the file counting newlines, which is basically, reading it. 您必须扫描计算换行符的文件,基本上是读取它。

But you can get pretty close which sounds like it is enough for your needs. 但是你可以非常接近,听起来它足以满足你的需求。 Say for two parts, take the file size, divide that by 2. 比如两个部分,取文件大小,除以2。

  • The first part you start at zero, and stop after completing the record at file_size / 2 . 第一部分从零开始, 完成file_size / 2的记录停止。
  • The second part, you seek to file_size / 2 , look for the next new line, and start there. 第二部分,你寻找file_size / 2 ,寻找下一个新行,并从那里开始。

This way while the Python processes won't all get exactly the same amount it will be pretty close, and avoids too much inter-process message passing or multi-threading and with CPython probably the global interpreter lock. 这种方式虽然Python进程不会全部获得完全相同的数量,但它将非常接近,并避免过多的进程间消息传递或多线程,并且CPython可能是全局解释器锁。


Of course all the normal things for optimising either file IO, or Python code still apply (depending on where your bottleneck lies. You need to measure this .). 当然,优化文件IO或Python代码的所有常规事情仍然适用(取决于您的瓶颈所在。您需要对此进行测量 。)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM