简体   繁体   English

如何加快python中的文件解析?

[英]How to speed up file parsing in python?

Below is a section from an app I have been working on. 以下是我一直在研究的应用程序的一部分。 The section is used to update a text file with addValue. 本部分用于使用addValue更新文本文件。 At first I thought it was working but it seams to add more lines in and also it is very very slow. 起初我以为它可以正常工作,但是它接缝处要添加更多的行,而且速度非常慢。

trakt_shows_seen is a dictionary of shows, 1 show section looks like trakt_shows_seen是一个表演字典,1个表演部分看起来像

{'episodes': [{'season': 1, 'playcount': 0, 'episode': 1}, {'season': 1, 'playcount': 0, 'episode': 2}, {'season': 1, 'playcount': 0, 'episode': 3}], 'title': 'The Ice Cream Girls'}

The section should search for each title, season and episode in the file and when found check if it has a watched marker (checkValue) if it does, it changes it to addvalue, if it does not it should add addValue to the end of the line. 该部分应搜索文件中的每个标题,季节和情节,并在找到该文件时检查是否有监视标记(checkValue)(如果有),将其更改为addvalue,如果没有,则应在其末尾添加addValue。线。

A line from the file 文件中的一行

_F  /share/Storage/NAS/Videos/Tv/The Ice Cream Girls/Season 01/The Ice Cream Girls - S01E01 - Episode 1.mkv _ai Episode 1   _e  1   _r  6.5 _Y  71  _s  1   _DT 714d861 _et Episode 1   _A  4379,4376,4382,4383 _id 2551    _FT 714d861 _v  c0=h264,f0=25,h0=576,w0=768 _C  T   _IT 717ac9d _R  GB: _m  1250    _ad 2013-04-19  _T  The Ice Cream Girls _G  d   _U   thetvdb:268910 imdb:tt2372806  _V  HDTV

So my question, is there a better faster way? 所以我的问题是,有没有更好的更快方法? Can I load the file into memory (file is around 1Mb) change the required lines and then save the file, or can anyone suggest another method that will speed things up. 我可以将文件加载到内存(文件约为1Mb)中,更改所需的行,然后保存文件,或者有人可以建议另一种方法来加快处理速度。

Thanks for taking the time to look. 感谢您抽出宝贵的时间查看。

EDIT I have changed the code quite a lot and this does work a lot faster, but the output is not as expected, for some reason it writes lines_of_interest to the file even though there is no code to do this?? 编辑我已经相当多地更改了代码,并且确实可以更快地工作,但是输出不符合预期,由于某种原因,即使没有代码也可以将lines_of_interest写入文件中?

I also have not yet added any encoding options but as the file is in utf-8 I suspect there will be an issue with accented titles. 我还没有添加任何编码选项,但是由于文件位于utf-8中,我怀疑重音标题会出现问题。

    if trakt_shows_seen:
        addValue = "\t_w\t1\t"
        replacevalue = "\t_w\t0\t"
        with open(OversightFile, 'rb') as infile:
            p = '\t_C\tT\t'
            for line in infile:
                if p in line:
                    tv_offset = infile.tell() - len(line) - 1#Find first TV in file, search from here
                    break

            lines_of_interest = set()
            for show_dict in trakt_shows_seen:
                for episode in show_dict['episodes']:
                    p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show_dict["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
                    infile.seek(tv_offset)#search from first Tv show
                    for line in infile:
                        if p.findall(line):
                            search_offset = infile.tell() - len(line) - 1
                            lines_of_interest.add(search_offset)#all lines that need to be changed
        with open(OversightFile, 'rb+') as outfile:
            for lines in lines_of_interest:
                for change_this in outfile:
                    outfile.seek(lines)
                    if replacevalue in change_this:
                        change_this = change_this.replace(replacevalue, addValue)
                        outfile.write(change_this)
                        break#Only check 1 line
                    elif not addValue in change_this:
                        #change_this.extend(('_w', '1'))
                        change_this = change_this.replace("\t\n", addValue+"\n")
                        outfile.write(change_this)
                        break#Only check 1 line

Aham -- you are opening, reading and rewriting your file in every repetition of your for loop - once for each episode for each show. 亚当(Aham)-您将在for循环的每次重复中打开,读取和重写文件-每个节目的每一集一次。 few things in the whole Multiverse could be slower than that. 整个“多元宇宙”中几乎没有什么比这更慢的了。

You cango along the same line - just read all your "file" once, before the for loops, iterate over the list read, and write everything back to disk, just once = 您可以沿同一行进行-只需在for循环之前读取所有“文件”一次,然后遍历读取的列表,然后将所有内容写回到磁盘,只需一次=

more or less: 或多或少:

if trakt_shows_seen:
    addValue = "\t_w\t1\t"
    checkvalue = "\t_w\t0\t"
    print '  %s TV shows episodes playcount will be updated on Oversight' % len(trakt_shows_seen)
    myfile_list = open(file).readlines()
    for show in trakt_shows_seen:
        print '    --> ' + show['title'].encode('utf-8')
        for episode in show['episodes']:
            print '     Season %i - Episode %i' % (episode['season'], episode['episode'])
            p = re.compile(r'\t_s\t('+str(episode["season"])+')\t.*\t_T\t('+show["title"]+')\t.*\t_e\t('+str(episode["episode"])+')\t')
            newList = []

            for line in myfile_list:
                if p.findall(line) :
                    if checkvalue in line:
                        line = line.replace(checkvalue, addValue)
                    elif not addValue in line:
                        line = line.strip("\t\n") + addValue+"\n"
                newList.append(line)
            myfile_list = newlist

    outref = open(file,'w')
    outref.writelines(newList)
    outref.close()

This is still far from optimal - but is the least amoutn of change in your code to stop what is slowing it down so much. 这仍然不是最佳选择-但这是代码中阻止更改速度如此之慢的最少更改。

You're rereading and rewriting your entire file for every episode of every show you track - of course this is slow. 您正在为跟踪的每个节目的每一集重新读取和重写整个文件-当然,这很慢。 Don't do that. 不要那样做 Instead, read the file once. 而是,一次读取文件。 Parse out the show title and season and episode numbers from each line (probably using the csv built-in library with delimiter='\\t'), and see if they're in the set you're tracking. 解析每行的节目标题,季节和剧集编号(可能使用带有delimiter ='\\ t'的csv内置库),查看它们是否在您要跟踪的集合中。 Make your substitution if they are, and write the line either way. 如果需要,请进行替换,并以任何一种方式写行。

It's going to look something like this: 它看起来像这样:

title_index = # whatever column number has the show title
season_index = # whatever column number has the season number
episode_index = # whatever column number has the episode number

with open('somefile', 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t')
    modified_lines = []
    for line in reader:
        showtitle = line[title_index]
        if showtitle in trakt_shows_seen:
            season_number = int(line[season_index])
            episode_number = int(line[episode_index])
            if any((x for x in trakt_shows_seen[showtitle] if x['season'] = season_number and x['episode'] = episode_number)):
                # line matches a tracked episode
                watch_count_index = line.index('_w')
                if watch_count_index != -1:
                    # possible check value found - you may be able to skip straight to assigning the next element to '1'
                    if line[watch_count_index + 1] == '0':
                        # check value found, replace
                        line[watch_count_index + 1] = '1'
                    elif line[watch_count_index + 1] != '1':
                        # not sure what you want to do if something like \t_w\t2\t is present 
                        line[watch_count_index + 1] = '1'
                else:
                     line.extend(('_w', '1'))
        modified_lines.append(line)
with open('somefile', 'wb') as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerows(modified_lines)

The exact details will depend on how strict your file format is - the more you know about the structure of the line beforehand the better. 确切的详细信息将取决于文件格式的严格程度-事先对行的结构了解得越多越好。 If the indices of the title, season and episode fields vary, probably the best thing to do is iterate once through the list representing the line looking for the relevant markers. 如果标题,季节和情节字段的索引不同,则可能最好的做法是遍历代表表示相关标记的那一行的列表。

I have skipped over error checking - depending on your confidence in the original file you might want to ensure that season and episode numbers can be converted to ints, or stringify your trakt_shows_seen values. 我已经跳过了错误检查-根据您对原始文件的信心,您可能希望确保将季节和剧集号转换为整数,或将trakt_shows_seen值字符串化。 The csv reader will return encoded bytestrings, so if show names in trakt_shows_seen are Unicode objects (which they don't appear to be in your pasted code) you should either decode the csv reader's results or encode the dictionary values. csv阅读器将返回编码的字节trakt_shows_seen ,因此,如果trakt_shows_seen中的显示名称是Unicode对象(它们似乎不在您的粘贴代码中),则应解码csv阅读器的结果或对字典值进行编码。

I personally would probably convert trakt_shows_seen to a set of (title, season, episode) tuples, for more convenient checking to see if a line is of interest. 我个人可能会将trakt_shows_seen转换为一组(标题,季节,情节)元组,以便更方便地检查一行是否有意义。 At least if the field numbers for title, season and episode are fixed. 至少如果标题,季节和情节的字段编号是固定的。 I would also write to my outfile file (under a different filename) as I read the input file rather than keeping a list of lines in memory; 在读取输入文件时,我还将写入我的外文件文件(使用不同的文件名),而不是在内存中保留行列表。 that would allow some sanity checking with, say, a shell's diff utility before overwriting the original input. 这样,在覆盖原始输入之前,可以使用Shell的diff实用程序进行完整性检查。

To create a set from your existing dictionary - to some extent it depends on exactly what format trakt_shows_seen uses. 要从现有字典创建集合,在某种程度上取决于trakt_shows_seen使用的格式。 Your example shows an entry for one show, but doesn't indicate how it represents more than one show. 您的示例显示了一个节目的条目,但没有指出它如何代表多个节目。 For now I'm going to assume it's a list of such dictionaries, based on your attempted code. 现在,根据您尝试的代码,我将假定它是此类词典的列表。

shows_of_interest = set()
for show_dict in trakt_shows_seen:
    title = show_dict['title']
    for episode_dict in show_dict['episodes']:
        shows_of_interest.add((title, episode_dict['season'], episode_dict['episode']))

Then in the loop that reads the file: 然后在读取文件的循环中:

        # the rest as shown above              
        season_number = int(line[season_index])
        episode_number = int(line[episode_index])
        if (showtitle, season_number, episode_number) in shows_of_interest:
            # line matches a tracked episode

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM