Reduce runtime, file reading, string manipulation of every line and file writing

Question

I'm writing on a script that reads all lines from multiple files, reads in a number at the beginning of each block and puts that number in front of every line of the block until the next number and so on. Afterwards it writes all read lines into a single .csv file.

The files I am reading look like this:

13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25

And the output file should look like this:

13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25

Currently my script is this:

from asyncio import Semaphore, ensure_future, gather, run
import time

limit = 8

async def read(file_list):
    tasks = list()
    result = None

    sem = Semaphore(limit)

    for file in file_list:
        task = ensure_future(read_bounded(file,sem))
        tasks.append(task)

        result = await gather(*tasks)

    return result

async def read_bounded(file,sem):
    async with sem:
        return await read_one(file)

async def read_one(filename):
    result = list()
    with open(filename) as file:
        dataList=[]
        content = file.read().split(":")
        file.close()
        j=1
        filmid=content[0]
        append=result.append
        while j<len(content):
            for entry in content[j].split("\n"):
                if len(entry)>10:
                    append("%s%s%s%s" % (filmid,",",entry,"\n"))
                else:
                    if len(entry)>0:
                        filmid=entry
            j+=1
    return result

if __name__ == '__main__':
    start=time.time()
    write_append="w"
    files = ['combined_data_1.txt', 'combined_data_2.txt', 'combined_data_3.txt', 'combined_data_4.txt']

    res = run(read(files))

    with open("output.csv",write_append) as outputFile:
        for result in res:
            outputFile.write(''.join(result))
            outputFile.flush()
    outputFile.close()
    end=time.time()
    print(end-start)

It has a runtime of about 135 Seconds (The 4 files that are read are each 500MB big and the output file has 2.3GB). Running the script takes about 10GB of RAM. I think this might be a problem. The biggest amount of time is needed to create the list of all lines, I think. I would like to reduce the runtime of this program, but I am new to python and not sure how to do this. Can you give me some advice?

Thanks

Edit:

I measured the times for the following commands in cmd (I have only Windows installed on my Computer, so I used hopefully equivalent cmd-Commands):

sequential writing to NUL

timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt  > NUL"

combined_data_1.txt


combined_data_2.txt


combined_data_3.txt


combined_data_4.txt

command took 0:1:25.87 (85.87s total)

sequential writing to file

timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt  > test.csv"

combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt

command took 0:2:42.93 (162.93s total)

parallel

timecmd "type combined_data_1.txt > NUL & type combined_data_2.txt > NUL & type combined_data_3.txt >NUL & type combined_data_4.txt  > NUL"
command took 0:1:25.51 (85.51s total)

Answer 1

In this case you're not gaining anything by using asyncio for two reasons:

asyncio is single-threaded and doesn't parallelize processing (and, in Python, neither can threads )
the IO calls access the file system, which asyncio doesn't cover - it is primarily about network IO

The giveaway that you're not using asyncio correctly is the fact that your read_one coroutine doesn't contain a single await . That means that it never suspends execution, and that it will run to completion before ever yielding to another coroutine. Making it an ordinary function (and dropping asyncio altogether) would have the exact same result.

Here is a rewritten version of the script with the following changes:

byte IO throughout, for efficiency
iterates through the file rather than loading all at once
sequential code

import sys

def process(in_filename, outfile):
    with open(in_filename, 'rb') as r:
        for line in r:
            if line.endswith(b':\n'):
                prefix = line[:-2]
                continue
            outfile.write(b'%s,%s' % (prefix, line))

def main():
    in_files = sys.argv[1:-1]
    out_file = sys.argv[-1]
    with open(out_file, 'wb') as out:
        for fn in in_files:
            process(fn, out)

if __name__ == '__main__':
    main()

On my machine and Python 3.7, this version performs at approximately 22 s/GiB, tested on four randomly generated files, of 550 MiB each. It has a negligible memory footprint because it never loads the whole file into memory.

The script runs on Python 2.7 unchanged, where it clocks at 27 s/GiB. Pypy (6.0.0) runs it much faster, taking only 11 s/GiB.

Using concurrent.futures in theory ought to allow processing in one thread while another is waiting for IO, but the result ends up being significantly slower than the simplest sequential approach.

Answer 2

You want to read 2 GiB and write 2 GiB with low elapsed time and low memory consumption. Parallelism, for core and for spindle, matters. Ideally you would tend to keep all of them busy. I assume you have at least four cores available. Chunking your I/O matters, to avoid excessive malloc'ing.

Start with the simplest possible thing. Please make some measurements and update your question to include them.

sequential

Please make sequential timing measurements of

$ cat combined_data_[1234].csv > /dev/null

and

$ cat combined_data_[1234].csv > big.csv

I assume you will have low CPU utilization, and thus will be measuring read & write I/O rates.

parallel

Please make parallel I/O measurements:

cat combined_data_1.csv > /dev/null &
cat combined_data_2.csv > /dev/null &
cat combined_data_3.csv > /dev/null &
cat combined_data_4.csv > /dev/null &
wait

This will let you know if overlapping reads offers a possibility for speedup. For example, putting the files on four different physical filesystems might allow this -- you'd be keeping four spindles busy.

async

Based on these timings, you may choose to ditch async I/O, and instead fork off four separate python interpreters.

logic

        content = file.read().split(":")

This is where much of your large memory footprint comes from. Rather than slurping in the whole file at once, consider reading by lines, or in chunks. A generator might offer you a convenient API for that.

EDIT:

compression

It appears that you are I/O bound -- you have idle cycles while waiting on the disk. If the final consumer of your output file is willing to do decompression, then consider using gzip , xz/lzma , or snappy . The idea is that most of the elapsed time is spent on I/O, so you want to manipulate smaller files to do less I/O. This benefits your script when writing 2 GiB of output, and may also benefit the code that consumes that output.

As a separate item, you might possibly arrange for the code that produces the four input files to produce compressed versions of them.

Answer 3

I have tried to solve your problem. I think this is very easy and simple way if don't have any prior knowledge of any special library.

I just took 2 input files named input.txt & input2.txt with following contents.

Note: All files are in same directory.

input.txt

13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25

input2.txt

13364:
2385001,5,2004-06-08
659435,1,2005-03-16
13370:
751811,2,2023-12-16
2625220,2,2015-05-26

I have written the code in modular way so that you could easily import and use it in your project. Once you run the below code from terminal using python3 csv_writer.py , it will read all the files provided in list file_names and generate output.csv will the result that you're looking for.

csv_writer.py

# https://stackoverflow.com/questions/55226823/reduce-runtime-file-reading-string-manipulation-of-every-line-and-file-writing
import re

def read_file_and_get_output_lines(file_names):
    output_lines = []

    for file_name in file_names:
        with open(file_name) as f:
            lines = f.readlines()
            for new_line in lines:
                new_line = new_line.strip()

                if not re.match(r'^\d+:$', new_line):
                    output_line = [old_line]
                    output_line.extend(new_line.split(","))
                    output_lines.append(output_line)
                else:
                    old_line = new_line.rstrip(":")

    return output_lines

def write_lines_to_csv(output_lines, file_name):
    with open(file_name, "w+") as f:
        for arr in output_lines:
            line = ",".join(arr)
            f.write(line + '\n')

if __name__ == "__main__":
    file_names = [
        "input.txt",
        "input2.txt"
    ]

    output_lines = read_file_and_get_output_lines(file_names)
    print(output_lines)
    # [['13368', '2385003', '4', '2004-07-08'], ['13368', '659432', '3', '2005-03-16'], ['13369', '751812', '2', '2002-12-16'], ['13369', '2625420', '2', '2004-05-25'], ['13364', '2385001', '5', '2004-06-08'], ['13364', '659435', '1', '2005-03-16'], ['13370', '751811', '2', '2023-12-16'], ['13370', '2625220', '2', '2015-05-26']]

    write_lines_to_csv(output_lines, "output.csv")

output.csv

13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
13364,2385001,5,2004-06-08
13364,659435,1,2005-03-16
13370,751811,2,2023-12-16
13370,2625220,2,2015-05-26

Reduce runtime, file reading, string manipulation of every line and file writing

Question

3 answers

solution1
2 ACCPTED 2019-03-18 22:38:17

solution2
1 2019-03-18 18:10:32

sequential

parallel

async

logic

compression

solution3
0 2019-03-18 18:58:16

Reduce runtime, file reading, string manipulation of every line and file writing

Question

3 answers

solution1 2 ACCPTED 2019-03-18 22:38:17

solution2 1 2019-03-18 18:10:32

sequential

parallel

async

logic

compression

solution3 0 2019-03-18 18:58:16

solution1
2 ACCPTED 2019-03-18 22:38:17

solution2
1 2019-03-18 18:10:32

solution3
0 2019-03-18 18:58:16