Looping through big files takes hours in Python

Question

This is my second day working in Python .I worked on this in C++ for a while, but decided to try Python. My program works as expected. However, when I process one file at a time without the glob loop, it takes about a half hour per file. When I include the glob, the loop takes about 12 hours to process 8 files.

My question is this, is there anything in my program that is definitely slowing it down? is there anything I should be doing to make it faster?

I have a folder of large files. For example

file1.txt (6gb) file2.txt (5.5gb) file3.txt (6gb)

If it helps, each line of data begins with a character that tells me how the rest of the characters are formatted, which is why I have all of the if elif statements. A line of data would look like this: T35201 M352 RZNGA AC

I am trying to read each file, do some parsing using splits, and then save the file.

The computer has 32gb of ram, so my method is to read each file into ram, and then loop through the file, and then save, clearing ram for the next file.

I've included the file so you can see the methods that I am using. I use an if elif statement that uses about 10 different elif commands. I have tried a dictionary, but I couldn't figure that out to save my life.

Any answers would be helpful.

import csv
import glob

for filename in glob.glob("/media/3tb/5may/*.txt"):
    f = open(filename,'r')
    c = csv.writer(open(filename + '.csv','wb'))

    second=0
    mill=0
    for line in f.readlines():
       #print line
        event=0
        ticker=0
        marketCategory=0
        variable = line[0:1]    

        if variable is 'T':
           second = line[1:6]
           mill=0
        else: 
           second = second 

        if variable is 'R':
           ticker = line[1:7]   
           marketCategory = line[7:8]
        elif variable is ...
        elif variable is ...
        elif ...
        elif ...
        elif ...
        elif ...
        elif        

        if variable (!= 'T') and (!= 'M')
            c.writerow([second,mill,event ....]) 
   f.close()

UPDATE Each of the elif statements are nearly identical. The only parts that change are the ways that I split the lines. Here are two elif statements (There are 13 total, and they are almost all identical except for the way that they are split.)

  elif variable is 'C':
     order = line[1:10]
     Shares = line[10:16]
     match = line[16:25]
     printable = line[25:26]
     price = line[26:36]
   elif variable is 'P':
     ticker = line[17:23]
     order = line[1:10]
     buy = line[10:11]
     shares = line[11:17]
     price = line[23:33]
     match = line[33:42]

UPDATE2 I have ran the code using for file in f two different times. The first time I ran a single file without for filename in glob.glob("/media/3tb/file.txt"): and it took about 30 minutes manually coding the file path for one file.

I ran it again with for filename in glob.glob("/media/3tb/*file.txt") and it took an hour just for one file in the folder. Does the glob code add that much time?

Answer 1

Here:

for line in f.readlines():

You should just do this:

for line in f:

The former reads the entire file into a list of lines, then iterates over that list. The latter does it incrementally, which should drastically reduce the total memory allocated and later freed by your program.

Answer 2

Whenever you ask "what part of this is slowing down the whole thing?" the answer is "profile it." There's an excellent description of how to do this in Python's documentation at The Python Profilers . Also, as John Zwinck points out, you're loading too much into memory at once and should be only loading one line at a time (file objects are "iterable" in Python).

Personally, I prefer what Perl calls "dispatch table" to a huge if..elif...elif monstrosity. This webpage describes a Pythonic way of doing it. It's a dictionary of keys to functions, which doesn't work in all cases but for simple if x==2:...elif x==3... (ie, switching on the value of one variable) it works great.

Answer 3

Use an iterable (by using yield) to 'buffer' more lines into memory than just one line at a time but NOT the whole file at a time.

def readManyLines(fObj,num=1000):
  lines = fObj.readlines(num)
  for line in lines:
    yield line

f = open(filename,'r')
for line in readManyLines(f):
  process(line)

Answer 4

Not sure if this helps at all, but try using this instead of the glob.glob just to rule out that being the problem. I'm on windows so I can't be 100% certain this works under unix, but I don't see why it wouldn't.

import re
import os
import csv

def find_text_files(root):
    """Find .txt files under a given directory"""
    foundFiles = []
    for dirpath, dirnames, filenames in os.walk(root):
        for file in filenames:
            txt = re.compile(r'txt$',re.I,).search(file)
            if txt:
                foundFiles.append(os.path.join(dirpath,file))
    return foundFiles

txtfiles = find_text_files('d:\files') #replace the path with yours

for filename in txtfiles:
    f = open(filename,'r')
    c = csv.writer(open(filename + '.csv','wb'))

Looping through big files takes hours in Python

Question

4 answers

solution1
9 ACCPTED 2013-02-22 14:06:00

solution2
2 2013-02-22 14:18:27

solution3
1 2013-02-22 18:17:31

solution4
0 2013-02-22 18:36:44

Looping through big files takes hours in Python

Question

4 answers

solution1 9 ACCPTED 2013-02-22 14:06:00

solution2 2 2013-02-22 14:18:27

solution3 1 2013-02-22 18:17:31

solution4 0 2013-02-22 18:36:44

solution1
9 ACCPTED 2013-02-22 14:06:00

solution2
2 2013-02-22 14:18:27

solution3
1 2013-02-22 18:17:31

solution4
0 2013-02-22 18:36:44