How to speed up file parsing in python?

Question

Below is a section from an app I have been working on. The section is used to update a text file with addValue. At first I thought it was working but it seams to add more lines in and also it is very very slow.

trakt_shows_seen is a dictionary of shows, 1 show section looks like

{'episodes': [{'season': 1, 'playcount': 0, 'episode': 1}, {'season': 1, 'playcount': 0, 'episode': 2}, {'season': 1, 'playcount': 0, 'episode': 3}], 'title': 'The Ice Cream Girls'}

The section should search for each title, season and episode in the file and when found check if it has a watched marker (checkValue) if it does, it changes it to addvalue, if it does not it should add addValue to the end of the line.

A line from the file

_F  /share/Storage/NAS/Videos/Tv/The Ice Cream Girls/Season 01/The Ice Cream Girls - S01E01 - Episode 1.mkv _ai Episode 1   _e  1   _r  6.5 _Y  71  _s  1   _DT 714d861 _et Episode 1   _A  4379,4376,4382,4383 _id 2551    _FT 714d861 _v  c0=h264,f0=25,h0=576,w0=768 _C  T   _IT 717ac9d _R  GB: _m  1250    _ad 2013-04-19  _T  The Ice Cream Girls _G  d   _U   thetvdb:268910 imdb:tt2372806  _V  HDTV

So my question, is there a better faster way? Can I load the file into memory (file is around 1Mb) change the required lines and then save the file, or can anyone suggest another method that will speed things up.

Thanks for taking the time to look.

EDIT I have changed the code quite a lot and this does work a lot faster, but the output is not as expected, for some reason it writes lines_of_interest to the file even though there is no code to do this??

I also have not yet added any encoding options but as the file is in utf-8 I suspect there will be an issue with accented titles.

    if trakt_shows_seen:
        addValue = "\t_w\t1\t"
        replacevalue = "\t_w\t0\t"
        with open(OversightFile, 'rb') as infile:
            p = '\t_C\tT\t'
            for line in infile:
                if p in line:
                    tv_offset = infile.tell() - len(line) - 1#Find first TV in file, search from here
                    break

            lines_of_interest = set()
            for show_dict in trakt_shows_seen:
                for episode in show_dict['episodes']:
                    p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show_dict["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
                    infile.seek(tv_offset)#search from first Tv show
                    for line in infile:
                        if p.findall(line):
                            search_offset = infile.tell() - len(line) - 1
                            lines_of_interest.add(search_offset)#all lines that need to be changed
        with open(OversightFile, 'rb+') as outfile:
            for lines in lines_of_interest:
                for change_this in outfile:
                    outfile.seek(lines)
                    if replacevalue in change_this:
                        change_this = change_this.replace(replacevalue, addValue)
                        outfile.write(change_this)
                        break#Only check 1 line
                    elif not addValue in change_this:
                        #change_this.extend(('_w', '1'))
                        change_this = change_this.replace("\t\n", addValue+"\n")
                        outfile.write(change_this)
                        break#Only check 1 line

Answer 1

Aham -- you are opening, reading and rewriting your file in every repetition of your for loop - once for each episode for each show. few things in the whole Multiverse could be slower than that.

You cango along the same line - just read all your "file" once, before the for loops, iterate over the list read, and write everything back to disk, just once =

more or less:

if trakt_shows_seen:
    addValue = "\t_w\t1\t"
    checkvalue = "\t_w\t0\t"
    print '  %s TV shows episodes playcount will be updated on Oversight' % len(trakt_shows_seen)
    myfile_list = open(file).readlines()
    for show in trakt_shows_seen:
        print '    --> ' + show['title'].encode('utf-8')
        for episode in show['episodes']:
            print '     Season %i - Episode %i' % (episode['season'], episode['episode'])
            p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
            newList = []

            for line in myfile_list:
                if p.findall(line) :
                    if checkvalue in line:
                        line = line.replace(checkvalue, addValue)
                    elif not addValue in line:
                        line = line.strip("\t\n") + addValue+"\n"
                newList.append(line)
            myfile_list = newlist

    outref = open(file,'w')
    outref.writelines(newList)
    outref.close()

This is still far from optimal - but is the least amoutn of change in your code to stop what is slowing it down so much.

Answer 2

You're rereading and rewriting your entire file for every episode of every show you track - of course this is slow. Don't do that. Instead, read the file once. Parse out the show title and season and episode numbers from each line (probably using the csv built-in library with delimiter='\\t'), and see if they're in the set you're tracking. Make your substitution if they are, and write the line either way.

It's going to look something like this:

title_index = # whatever column number has the show title
season_index = # whatever column number has the season number
episode_index = # whatever column number has the episode number

with open('somefile', 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t')
    modified_lines = []
    for line in reader:
        showtitle = line[title_index]
        if showtitle in trakt_shows_seen:
            season_number = int(line[season_index])
            episode_number = int(line[episode_index])
            if any((x for x in trakt_shows_seen[showtitle] if x['season'] = season_number and x['episode'] = episode_number)):
                # line matches a tracked episode
                watch_count_index = line.index('_w')
                if watch_count_index != -1:
                    # possible check value found - you may be able to skip straight to assigning the next element to '1'
                    if line[watch_count_index + 1] == '0':
                        # check value found, replace
                        line[watch_count_index + 1] = '1'
                    elif line[watch_count_index + 1] != '1':
                        # not sure what you want to do if something like \t_w\t2\t is present 
                        line[watch_count_index + 1] = '1'
                else:
                     line.extend(('_w', '1'))
        modified_lines.append(line)
with open('somefile', 'wb') as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerows(modified_lines)

The exact details will depend on how strict your file format is - the more you know about the structure of the line beforehand the better. If the indices of the title, season and episode fields vary, probably the best thing to do is iterate once through the list representing the line looking for the relevant markers.

I have skipped over error checking - depending on your confidence in the original file you might want to ensure that season and episode numbers can be converted to ints, or stringify your trakt_shows_seen values. The csv reader will return encoded bytestrings, so if show names in trakt_shows_seen are Unicode objects (which they don't appear to be in your pasted code) you should either decode the csv reader's results or encode the dictionary values.

I personally would probably convert trakt_shows_seen to a set of (title, season, episode) tuples, for more convenient checking to see if a line is of interest. At least if the field numbers for title, season and episode are fixed. I would also write to my outfile file (under a different filename) as I read the input file rather than keeping a list of lines in memory; that would allow some sanity checking with, say, a shell's diff utility before overwriting the original input.

To create a set from your existing dictionary - to some extent it depends on exactly what format trakt_shows_seen uses. Your example shows an entry for one show, but doesn't indicate how it represents more than one show. For now I'm going to assume it's a list of such dictionaries, based on your attempted code.

shows_of_interest = set()
for show_dict in trakt_shows_seen:
    title = show_dict['title']
    for episode_dict in show_dict['episodes']:
        shows_of_interest.add((title, episode_dict['season'], episode_dict['episode']))

Then in the loop that reads the file:

        # the rest as shown above              
        season_number = int(line[season_index])
        episode_number = int(line[episode_index])
        if (showtitle, season_number, episode_number) in shows_of_interest:
            # line matches a tracked episode

How to speed up file parsing in python?

Question

2 answers

solution1
0 2013-08-11 00:16:08

solution2
0 2013-08-11 00:18:05

How to speed up file parsing in python?

Question

2 answers

solution1 0 2013-08-11 00:16:08

solution2 0 2013-08-11 00:18:05

solution1
0 2013-08-11 00:16:08

solution2
0 2013-08-11 00:18:05