简体   繁体   中英

Python: performance issues with islice

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.

import csv
import itertools
from collections import deque
import time

my_queue = deque()

start_row = 500004
stop_row = start_row + 50000

with open('test.csv', 'rb') as fin:
    #load into csv's reader
    csv_f = csv.reader(fin)

    #start logging time for performance
    start = time.time()

    for row in itertools.islice(csv_f, start_row, stop_row):
        my_queue.append(float(row[4])*float(row[10]))

    #stop logging time
    end = time.time()
    #display performance
    print "Initial queue populating time: %.2f" % (end-start)

For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s

That is islice being intelligent. Or lazy, depending on which term you prefer.

Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \\n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek ).

Line 4? Finding line 4 is fast, your computer just needs to find 3 \\n . Line 50004? Your computer has to read through the file until it finds 500003 \\n . No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.

As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)

In python

lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
     for i,j in enumerate(f):
          if i >= lines_I_want[0][0]:
              if i >= lines_I_want[0][1]:
                   lines_I_want.pop(0)
                   if not lines_I_want: #list is empty
                         break
              else:
                   #j is a line I want. Do something

And if you have any control over making that file, make every line the same length so you can seek . Or use a database.

The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.

If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.

Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.

import csv
from itertools import islice
import os
import sys

def open_binary_mode(filename, mode='r'):
    """ Open a file proper way (depends on Python verion). """
    kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
              dict(mode=mode, newline=''))
    return open(filename, **kwargs)

def split(infilename, num_chunks):
    infile_size = os.path.getsize(infilename)
    chunk_size = infile_size // num_chunks
    offset = 0
    num_rows = 0
    bytes_read = 0
    chunks = []
    with open_binary_mode(infilename, 'r') as infile:
        for _ in range(num_chunks):
            while bytes_read < chunk_size:
                try:
                    bytes_read += len(next(infile))
                    num_rows += 1
                except StopIteration:  # end of infile
                    break
            chunks.append((infilename, offset, num_rows))
            offset += bytes_read
            num_rows = 0
            bytes_read = 0
    return chunks

chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
    print('processing: {} rows starting at offset {}'.format(rows, offset))
    with open_binary_mode(filename, 'r') as fin:
        fin.seek(offset)
        for row in islice(csv.reader(fin), rows):
            print(row)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM