简体   繁体   English

Python:islice的性能问题

[英]Python: performance issues with islice

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. 使用以下代码,随着我增加islice中的起始行,我看到的执行时间越来越长。 For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. 例如,start_row为4将在1s内执行,但是start_row为500004将花费11s。 Why does this happen and is there a faster way to do this? 为什么会发生这种情况,有没有更快的方法呢? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations. 我希望能够遍历大型CSV文件(几个GB)中的行的多个范围并进行一些计算。

import csv
import itertools
from collections import deque
import time

my_queue = deque()

start_row = 500004
stop_row = start_row + 50000

with open('test.csv', 'rb') as fin:
    #load into csv's reader
    csv_f = csv.reader(fin)

    #start logging time for performance
    start = time.time()

    for row in itertools.islice(csv_f, start_row, stop_row):
        my_queue.append(float(row[4])*float(row[10]))

    #stop logging time
    end = time.time()
    #display performance
    print "Initial queue populating time: %.2f" % (end-start)

For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s 例如,start_row为4将在1s内执行,但是start_row为500004将花费11s

That is islice being intelligent. 那就是艾丽斯聪明。 Or lazy, depending on which term you prefer. 还是懒惰,取决于您喜欢哪个术语。

Thing is, files are "just" strings of bytes on your hard drive. 事实是,文件只是硬盘上的“正好”字节字符串。 They don't have any internal organization. 他们没有任何内部组织。 \\n is just another set of bytes in that long, long string. \\n只是该长字符串中的另一组字节。 There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek ). 在不查看任何特定信息之前,无法访问任何特定信息 (除非您的信息行的长度完全相同,在这种情况下,您可以使用file.seek )。

Line 4? 4号线? Finding line 4 is fast, your computer just needs to find 3 \\n . 查找第4行很快,您的计算机只需查找3 \\n Line 50004? 50004行? Your computer has to read through the file until it finds 500003 \\n . 您的计算机必须通读文件,直到找到500003 \\n为止。 No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back. 没办法解决,如果有人告诉您,否则他们要么拥有其他种类的量子计算机,要么他们的计算机正像世界上其他所有计算机一样在文件的后面读取文件。

As for what you can do about it: Try to be smart when trying to grab lines to iterate over. 至于您能做些什么:尝试抓住要迭代的行时,要变得聪明。 Smart, and lazy. 聪明而懒惰。 Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. 安排您的请求,以便仅遍历文件一次,并在提取所需数据后立即关闭文件。 (islice does all of this, by the way.) (顺便说一句,islice完成了所有这些工作。)

In python 在python中

lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
     for i,j in enumerate(f):
          if i >= lines_I_want[0][0]:
              if i >= lines_I_want[0][1]:
                   lines_I_want.pop(0)
                   if not lines_I_want: #list is empty
                         break
              else:
                   #j is a line I want. Do something

And if you have any control over making that file, make every line the same length so you can seek . 并且,如果您对制作该文件有任何控制权,请使每行的长度相同,以便可以seek Or use a database. 或使用数据库。

The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. 使用islice()进行操作的问题是,在返回任何内容之前,要遍历所有行,直到要查找的第一行。 Obviously the larger the starting row, the longer this will take. 显然,起始行越大,所需的时间就越长。 Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. 另一个是您正在使用csv.reader读取这些行,这可能会导致不必要的开销,因为csv文件的一行通常是一行。 The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon. 唯一不正确的时间是csv文件中包含包含嵌入式换行符的字符串字段-以我的经验,这是罕见的。

If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. 如果这是对数据的有效假设,则首先对文件建立索引并建立一个(文件名,偏移量,行数)元组表可能会快得多,该表元表示行/行的逻辑块大小大致相等在文件中。 With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on. 这样,您可以通过首先查找起始偏移量然后从该点开始读取指定数量的csv行来相对快速地处理它们。

Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. 这种方法的另一个优点是,它允许您并行处理这些块,我怀疑这是您要根据先前的问题尝试解决的实际问题。 So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case. 因此,即使您在这里没有提到多重处理,也可以将以下内容编写为与之兼容。

import csv
from itertools import islice
import os
import sys

def open_binary_mode(filename, mode='r'):
    """ Open a file proper way (depends on Python verion). """
    kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
              dict(mode=mode, newline=''))
    return open(filename, **kwargs)

def split(infilename, num_chunks):
    infile_size = os.path.getsize(infilename)
    chunk_size = infile_size // num_chunks
    offset = 0
    num_rows = 0
    bytes_read = 0
    chunks = []
    with open_binary_mode(infilename, 'r') as infile:
        for _ in range(num_chunks):
            while bytes_read < chunk_size:
                try:
                    bytes_read += len(next(infile))
                    num_rows += 1
                except StopIteration:  # end of infile
                    break
            chunks.append((infilename, offset, num_rows))
            offset += bytes_read
            num_rows = 0
            bytes_read = 0
    return chunks

chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
    print('processing: {} rows starting at offset {}'.format(rows, offset))
    with open_binary_mode(filename, 'r') as fin:
        fin.seek(offset)
        for row in islice(csv.reader(fin), rows):
            print(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM