递归递归递归—如何提高性能？（Python存档递归提取）

Question

I am trying to develop a Recursive Extractor. 我正在尝试开发递归提取器。 The problem is , it is Recursing Too Much (Evertime it found an archive type) and taking a performance hit. 问题是，它太多了（每次都找到一个存档类型）并导致性能下降。

So how can i improve below code? 那么如何改善以下代码？

My Idea 1: 我的想法1：

Get the 'Dict' of direcories first , together with file types.Filetypes as Keys. 首先获取目录的“目录”，以及文件类型。文件类型作为键。 Extract the file types. 提取文件类型。 When an Archive is found Extract only that one. 找到档案后，仅提取其中一个。 Then Regenerate Archive Dict again. 然后再次重新生成存档字典。

My Idea 2: 我的想法2：

os.walk returns Generator. os.walk返回生成器。 So is there something i can do with generators? 那我可以用发电机做什么？ I am new to Generators. 我是Generator的新手。

here is the current code : 这是当前代码：

import os, magic
m = magic.open( magic.MAGIC_NONE )
m.load()

archive_type = [ 'gzip compressed data',
        '7-zip archive data',
        'Zip archive data',
        'bzip2 compressed data',
        'tar archive',
        'POSIX tar archive',
        'POSIX tar archive (GNU)',
        'RAR archive data',
        'Microsoft Outlook email folder (>=2003)',
        'Microsoft Outlook email folder']

def extractRecursive( path ,archives):
    i=0
    for dirpath, dirnames, filenames in os.walk( path ):
        for f in filenames:
            fp = os.path.join( dirpath, f )
            i+=1
            print i
            file_type = m.file( fp ).split( "," )[0]
            if file_type in archives:
                arcExtract(fp,file_type,path,True)
                extractRecursive(path,archives)
    return "Done"



def arcExtract(file_path,file_type,extracted_path="/home/v3ss/Downloads/extracted",unlink=False):
    import subprocess,shlex


    if file_type in pst_types:
        cmd = "readpst -o  '%s' -S '%s'" % (extracted_path,file_path)
    else:
        cmd = "7z -y -r -o%s x '%s'" % (extracted_path,file_path)

    print cmd
    args= shlex.split(cmd)
    print args

    try:
        sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
        out, err = sp.communicate()
        print out, err
        ret = sp.returncode
    except OSError:
        print "Error no %s  Message %s" % (OSError.errno,OSError.message)
        pass

    if ret == 0:
        if unlink==True:
            os.unlink(file_path)
        return "OK!"
    else:
        return "Failed"
if __name__ == '__main__':
    extractRecursive( 'Path/To/Archives' ,archive_type)

Answer 1

You can simplify your extractRecursive method to use os.walk as it should be used. 您可以简化您的extractRecursive方法以使用os.walk因为它应该被使用。 os.walk already reads all subdirectories so your recursion is unneeded. os.walk已经读取所有子目录，因此不需要递归。

Simply remove the recursive call and it should work :) 只需删除递归调用，它应该可以工作:)

def extractRecursive(path, archives, extracted_archives=None):
    i = 0
    if not extracted_archives:
        extracted_archives = set()

    for dirpath, dirnames, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            i += 1
            print i
            file_type = m.file(fp).split(',')[0]
            if file_type in archives and fp not in extracted_archives:
                extracted_archives.add(fp)
                extracted_in.add(dirpath)
                arcExtract(fp, file_type, path, True)

    for path in extracted_in:
        extractRecursive(path, archives, extracted_archives)

    return "Done"

Answer 2

If, as it appears, you want to extract the archive files to paths "above" the one they're in, os.walk per se (in its normal top-down operation) can't help you (because by the time you extract an archive into a certain directory x, os.walk may likely, though not necessarily, already considered directory x -- so only by having os.walk look at the whole path over and over again can you get all contents). 如果您似乎将存档文件提取到其所在路径“上方”的路径，则os.walk本身（在其正常的自顶向下操作中）将无济于事（因为在您将档案解压缩到某个目录x中，os.walk可能（尽管不一定）已经考虑过目录x －因此，只有让os.walk反复查看整个路径，您才能获得所有内容）。 Except, I'm surprised your code ever terminates, since the archive-type files should keep getting found and extracted -- I don't see what can ever terminate the recursion. 除此之外，我很惊讶您的代码曾经终止，因为应该不断找到并提取档案类型的文件-我看不出有什么可以终止递归的。 (To solve that it would suffice to keep a set of all the paths of archive-type files you've already extracted, to avoid considering them again when you meet them again). （要解决的问题是，保留所有已经提取的存档类型文件的所有路径就足够了，以避免在再次遇到它们时再次考虑它们）。

By far the best architecture, anyway, would be if arcExtract was to return a list of all the files it has extracted (specifically their destination paths) -- then you could simply keep extending a list with all these extracted files during the os.walk loop (no recursion), and then keep looping just on the list (no need to keep asking the OS about files and directories, saving lots of time on that operation too) and producing a new similar list. 无论如何，到目前为止，最好的架构是如果arcExtract返回它已提取的所有文件的列表（特别是它们的目标路径），那么您可以在os.walk期间继续扩展包含所有这些提取文件的列表。循环（无递归），然后继续仅在列表上循环（无需继续向OS询问文件和目录，也节省了该操作的大量时间）并生成了一个新的类似列表。 No recursion, no redundancy of work. 没有递归，没有工作冗余。 I imagine that readpst and 7z are able to supply such lists (maybe on their standard output or error, which you currently just display but don't process) in some textual form that you could parse to make it into a list...? 我认为readpst和7z能够以某种文本形式提供这样的列表（也许以它们的标准输出或错误显示，您目前仅显示但未处理），可以将其解析为列表...？

递归递归递归—如何提高性能？（Python存档递归提取）

问题描述

2 个解决方案

解决方案1
1

解决方案2
1 已采纳

递归递归递归—如何提高性能？ （Python存档递归提取）

问题描述

2 个解决方案

解决方案1 1

解决方案2 1 已采纳

递归递归递归—如何提高性能？（Python存档递归提取）

解决方案1
1

解决方案2
1 已采纳