Improving a python code reading files

Question

I wrote a python script to treat text files. The input is a file with several lines. At the beginning of each line, there is a number (1, 2, 3... , n). Then an empty line and the last line on which some text is written.

I need to read through this file to delete some lines at the beginning and some in the end (say number 1 to 5 and then number 78 to end). I want to write the remaining lines on a new file (in a new directory) and renumber the first numbers written on these lines (in my example, 6 would become 1, 7 2 etc.)

I wrote the following:

def treatFiles(oldFile,newFile,firstF, startF, lastF):

    % firstF is simply an index
    % startF corresponds to the first line I want to keep
    % lastF corresponds to the last line I want to keep
    numberFToDeleteBeginning = int(startF) - int(firstF)
    with open(oldFile) as old, open(newFile, 'w') as new:
        countLine = 0
        for line in old:
            countLine += 1
            if countLine <= numberFToDeleteBeginning:
                pass
            elif countLine > int(lastF) - int(firstF):
                pass
            elif line.split(',')[0] == '\n':
                newLineList = line.split(',')
                new.write(line)
            else:        
                newLineList = [str(countLine - numberFToDeleteBeginning)] + line.split(',')
                del newLineList[1]
                newLine = str(newLineList[0])
                for k in range(1, len(newLineList)):
                    newLine = newLine + ',' + str(newLineList[k])    
                new.write(newLine)


    if __name__ == '__main__':
      from sys import argv
      import os

      os.makedirs('treatedFiles')
      new = 'treatedFiles/' + argv[1]
      treatFiles(argv[1], argv[2], newFile, argv[3], argv[4], argv[5])

My code works correctly but is far too slow (I have files of about 10Gb to treat and it's been running for hours).

Does anyone know how I can improve it?

Answer 1

I would get rid of the for loop in the middle and the expensive .split() :

from itertools import islice

def treatFiles(old_file, new_file, index, start, end):
    with open(old_file, 'r') as old, open(new_file, 'w') as new:
        sliced_file = islice(old, start - index, end - index)

        for line_number, line in enumerate(sliced_file, start=1):
            number, rest = line.split(',', 1)

            if number == '\n':
                new.write(line)
            else:
                new.write(str(line_number) + ',' + rest)

Also, convert your three numerical arguments to integers before passing them into the function:

treatFiles(argv[1], argv[2], newFile, int(argv[3]), int(argv[4]), int(argv[5]))

Improving a python code reading files

Question

1 answers

solution1
3 ACCPTED 2013-09-22 20:58:12

Improving a python code reading files

Question

1 answers

solution1 3 ACCPTED 2013-09-22 20:58:12

solution1
3 ACCPTED 2013-09-22 20:58:12