Optimizing generators for better execution time

Question

I have below code where i am splitting a big text file into smaller one's and i am using generators to iterate over the file and then processing it. It is highly memory efficient compared to a lists version i wrote, but it suffers badly in terms of execution speed. Below is my code and i have figured it out why it takes more time but i am not getting a way to optimize it.

def main():
    # file_name = input("Enter the full path of file you want to split into smaller inputFiles: ")
    file_name = "/rbhanot/Downloads/newtest.txt"
    input_file = open(file_name)
    num_lines_orig = sum(1 for _ in input_file)
    input_file.seek(0)
    # parts = int(input("Enter the number of parts you want to split in: "))
    parts = 3
    output_files = ((file_name + str(i)) for i in range(1, parts + 1))
    st = 0
    p = int(num_lines_orig / parts)
    ed = p
    for i in range(parts - 1):
        file = next(output_files)
        with open(file, "w") as OF:
            for _ in range(st, ed):
                OF.writelines(input_file.readline())

            st = ed
            ed = st + p
            if num_lines_orig - ed < p:
                ed = st + (num_lines_orig - ed) + p
            else:
                ed = st + p

    file = next(output_files)
    with open(file, "w") as OF:
        for _ in range(st, ed):
            OF.writelines(input_file.readline())


if __name__ == "__main__":
    main()

The part that most of the time is below where it loops over the range and then there are two function calls for reading and writing the files.

    for _ in range(st, ed):
        OF.writelines(input_file.readline())

Here is another version of same code using lists and apparently this works much faster

def main():
    # file_name = input("Enter the full path of file you want to split into smaller inputFiles: ")
    file_name = "/rbhanot/Downloads/newtest.txt"
    input_file = open(file_name).readlines()
    num_lines_orig = len(input_file)
    # parts = int(input("Enter the number of parts you want to split in: "))
    parts = 3
    output_files = [(file_name + str(i)) for i in range(1, parts + 1)]
    st = 0
    p = int(num_lines_orig / parts)
    ed = p
    for i in range(parts - 1):
        with open(output_files[i], "w") as OF:
            OF.writelines(input_file[st:ed])
        st = ed
        ed = st + p

    with open(output_files[-1], "w") as OF:
        OF.writelines(input_file[st:])


if __name__ == "__main__":
    main()

I know i can improve the execution speed by some fraction if I make this code multi threaded since most of the stuff here is IO, but i want to know if there is any other way to do the same without threading the code.

Thanks.

Answer 1

Your biggest bottleneck is file I/O. Reading from and writing to disk is slow .

You are, however, making matters a little worse by passing in single lines to the file.writelines() method. The latter expects an iterable of lines (the implementation effectively just iterates and calls file.write() for each element). Since a string is an iterable too giving you the individual characters, you are, in effect, writing single characters to the file buffer. Compared to file I/O, that's not that slow, but it is not efficient either. Don't use file.writelines() to write one line, just use file.write() .

Next, you are using repeated file.readline() calls. Don't use a method call for each line; you could use the file object as in iterator instead, and take a range of lines from it using itertools.islice() to pick limit how many lines are written. If you pass the islice() object to file.writelines() then that method would do the iteration:

with open(file, "w") as OF:
    OF.writelines(islice(input_file, p))

The above writes p number of lines to the OF file object. Note that we don't need to track start and end numbers at all here! If you need to tack on the 'remaining' lines of the file to the end, you only need to read the remainder of the input file and copy whatever is there to the last output file. You can vastly simplify the code by just looping parts times and creating the file name in the loop:

from itertools import islice
from shutil import copyfileobj

parts = int(input("Enter the number of parts you want to split in: "))

file_name = "/rbhanot/Downloads/newtest.txt"
with open(file_name) as input_file:
    num_lines_orig = sum(1 for _ in input_file)
    input_file.seek(0)

    chunk_size = num_lines_orig // parts

    for i in range(parts):
        output_file = f'{file_name}{i + 1}'
        with open(output_file, "w") as OF:
            OF.writelines(islice(input_file, chunk_size))

        if i == parts - 1:   # last iteration
            # copy across any remaining lines
            copyfileobj(input_file, OF)

I used the shutil.copyfileobj() function to handle the remainder copying; it'll read and write file data in blocks.

Optimizing generators for better execution time

Question

1 answers

solution1
0 ACCPTED 2018-06-12 13:32:09

Optimizing generators for better execution time

Question

1 answers

solution1 0 ACCPTED 2018-06-12 13:32:09

solution1
0 ACCPTED 2018-06-12 13:32:09