Improving the efficiency (memory/time) of the following python code

Question

I have to read about 300 files to create an association with the following piece of code. Given the association, I have to read them all in memory.

  with util.open_input_file(f) as f_in:
     for l in f_in:
        w = l.split(',')
        dfm = dk.to_key((idx, i, int(w[0]), int(w[1]))) <-- guaranteed to be unique for each line in file.
        cands  = w[2].split(':')
        for cand in cands:
          tmp_data.setdefault(cand, []).append(dfm)

Then I need to write out the data structure above in this format:

k1, v1:v2,v3....
k2, v2:v5,v6...

I use the following code:

    # Sort / join values.
    cand2dfm_data = {}
    for k,v in tmp_data.items():
        cand2dfm_data[k] = ':'.join(map(str, sorted(v, key=int)))
    tmp_data = {}

    # Write cand2dfm CSV file.
    with util.open_output_file(cand2dfm_file) as f_out:
        for k in sorted(cand2dfm_data.keys()):
            f_out.write('%s,%s\n' % (k, cand2dfm_data[k]))

Since I have to process a significant number of files, I'm observing two problems:

The memory used to store tmp_data is very big. In my use case, processing 300 files, it is using 42GB.
Writing out the CSV file is taking a long time. This is because I'm calling write() on each item() (about 2.2M). Furthermore, the output stream is using gzip compressor to save disk space.

In my use case, the numbers are guaranteed to be 32-bit unsigned.

Question:

To achieve memory reduction, I think it will be better to use a 32-bit int to store the data. Should I use ctypes.c_int() to store the values in the dict() (right now they are strings) or is there a better way?
For speeding up the writing, should I write to a StringIO object and then dump that to a file or is there a better way?
Alternatively, maybe there is a better way to accomplish the above logic without reading everything in memory?

Answer 1

Few thoughts.

Currently you are duplicating data multiple times in the memory. You are loading it for the first time into tmp_data , then copying everything into cand2dfm_data and then creating list of keys by calling sorted(cand2dfm_data.keys()) .
To reduce memory usage:
- Get rid of the tmp_data , parse and write your data directly to the cand2dfm_data
- Make cand2dfm_data a list of tuples, not the dict
- Use cand2dfm_data.sort(...) instead of sorted(cand2dfm_data) to avoid creation of a new list
To speed up processing:
- Convert keys into ints to improve sorting performance (this will reduce memory usage as well)
- Write data to disk in chunks, like 100 or 500 or 1000 records in one go, this should improve I\\O performance a bit
Use profiler to find other performance bottlenecks
If with above optimizations memory footprint will still be too large then consider using disk backed storage for storing and sorting temporary data, eg SQLite

Improving the efficiency (memory/time) of the following python code

Question

1 answers

solution1
2 ACCPTED 2016-10-22 10:36:48

Improving the efficiency (memory/time) of the following python code

Question

1 answers

solution1 2 ACCPTED 2016-10-22 10:36:48

solution1
2 ACCPTED 2016-10-22 10:36:48