Group lines of a text file in files per category - Most efficient way

Question

I have a huge text file (enough to fill my computer's memory) that I need to separate in smaller files.

The file contains CSV lines, where the first line is an ID:

ID1, val11, val12, val13
ID2, val21, val22, val23
ID1, val31, val32, val33
ID3, val41, val42, val43

I want to read each line, or groups of lines from the source file and create smaller files grouping the lines per ID:

File1:
    val11, val12, val13
    val31, val32, val33
File2:
    val21, val22, val23
File3:
    val41, val42, val43

So far I can do it with the following code, but that is taking reeeeally long (I don't have 10 days to do this).

def groupIDs(fileName,folder):
    loadedFile = open(fileName, 'r')
    firstLine = loadedFile.readline() #skip titles

    folder += "/"

    count = 0;

    for line in loadedFile:

        elems = line.split(',')
        id = elems[0]

        rest = ""
        for elem in elems[1:]:
            rest+=elem + ","

        with open(folder+id,'a') as f:
            f.write(rest[:-1])

        #printing progress
        count+=1
        if count % 50000 == 0:
            print(count)

    loadedFile.close()

The bottleneck seems to be HD performance, as shown by resource monitor (CPU usage is below 20%, memory is barely touched)

How can I improve this for best performance?

Answer 1

You could keep it in ram and only flush out every few 1000 lines or when the ram is filled to a degree selectable by you.

You should also use context managers with files and use the os.path or pathlib modules from the std lib instead of manually using strings as pathes.

Here is a solution that saves every 10000 lines, adjust as fits your problem:

import os
from glob import iglob
from collections import defaultdict


def split_files_into_categories(inputfiles, outputdir):
    count = 0
    categories = defaultdict(bytearray)

    for inputfile in inputfiles:
        with open(inputfile, 'rb')  as f:
            next(f) # skip first line

            for line in f:

                if count % 10000 == 0:
                    save_results(categories, outputdir)
                    categories.clear()

                category, _, rest = line.partition(b',')

                categories[category] += rest
                count += 1

    save_results(categories, outputdir)


def save_results(categories, outputdir):
    for category, data in categories.items():
        with open(os.path.join(outputdir, category.decode() + '.csv'), 'ab') as f:
            f.write(data)


if __name__ == '__main__':
    # run on all csvs in the data folder
    split_files_into_categories(iglob('data/*.csv'), 'by_category')

A few explanations:

I open the files in binary mode and use a bytearray , this prevents copying the data. In python strings are immutable, so += creates a new string and reassigns it.
defaultdict(bytearray) will create an empty bytearray for every new category, as soon as it is accessed for the first time.

You could replace the if count % 100000 == 0 with checking memory consumption like this:

 import os import psutil process = psutil.Process(os.getpid())

and then check

 # save results if process uses more than 1GB of ram if process.memory_info().rss > 1e9:

Group lines of a text file in files per category - Most efficient way

Question

1 answers

solution1
1 ACCPTED 2018-04-21 00:04:46

Group lines of a text file in files per category - Most efficient way

Question

1 answers

solution1 1 ACCPTED 2018-04-21 00:04:46

solution1
1 ACCPTED 2018-04-21 00:04:46