Recursion Recursion Recursion — How can i Improve Performance? (Python Archive Recursive Extraction)

Question

I am trying to develop a Recursive Extractor. The problem is , it is Recursing Too Much (Evertime it found an archive type) and taking a performance hit.

So how can i improve below code?

My Idea 1:

Get the 'Dict' of direcories first , together with file types.Filetypes as Keys. Extract the file types. When an Archive is found Extract only that one. Then Regenerate Archive Dict again.

My Idea 2:

os.walk returns Generator. So is there something i can do with generators? I am new to Generators.

here is the current code :

import os, magic
m = magic.open( magic.MAGIC_NONE )
m.load()

archive_type = [ 'gzip compressed data',
        '7-zip archive data',
        'Zip archive data',
        'bzip2 compressed data',
        'tar archive',
        'POSIX tar archive',
        'POSIX tar archive (GNU)',
        'RAR archive data',
        'Microsoft Outlook email folder (>=2003)',
        'Microsoft Outlook email folder']

def extractRecursive( path ,archives):
    i=0
    for dirpath, dirnames, filenames in os.walk( path ):
        for f in filenames:
            fp = os.path.join( dirpath, f )
            i+=1
            print i
            file_type = m.file( fp ).split( "," )[0]
            if file_type in archives:
                arcExtract(fp,file_type,path,True)
                extractRecursive(path,archives)
    return "Done"



def arcExtract(file_path,file_type,extracted_path="/home/v3ss/Downloads/extracted",unlink=False):
    import subprocess,shlex


    if file_type in pst_types:
        cmd = "readpst -o  '%s' -S '%s'" % (extracted_path,file_path)
    else:
        cmd = "7z -y -r -o%s x '%s'" % (extracted_path,file_path)

    print cmd
    args= shlex.split(cmd)
    print args

    try:
        sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
        out, err = sp.communicate()
        print out, err
        ret = sp.returncode
    except OSError:
        print "Error no %s  Message %s" % (OSError.errno,OSError.message)
        pass

    if ret == 0:
        if unlink==True:
            os.unlink(file_path)
        return "OK!"
    else:
        return "Failed"
if __name__ == '__main__':
    extractRecursive( 'Path/To/Archives' ,archive_type)

Answer 1

You can simplify your extractRecursive method to use os.walk as it should be used. os.walk already reads all subdirectories so your recursion is unneeded.

Simply remove the recursive call and it should work :)

def extractRecursive(path, archives, extracted_archives=None):
    i = 0
    if not extracted_archives:
        extracted_archives = set()

    for dirpath, dirnames, filenames in os.walk(path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            i += 1
            print i
            file_type = m.file(fp).split(',')[0]
            if file_type in archives and fp not in extracted_archives:
                extracted_archives.add(fp)
                extracted_in.add(dirpath)
                arcExtract(fp, file_type, path, True)

    for path in extracted_in:
        extractRecursive(path, archives, extracted_archives)

    return "Done"

Answer 2

If, as it appears, you want to extract the archive files to paths "above" the one they're in, os.walk per se (in its normal top-down operation) can't help you (because by the time you extract an archive into a certain directory x, os.walk may likely, though not necessarily, already considered directory x -- so only by having os.walk look at the whole path over and over again can you get all contents). Except, I'm surprised your code ever terminates, since the archive-type files should keep getting found and extracted -- I don't see what can ever terminate the recursion. (To solve that it would suffice to keep a set of all the paths of archive-type files you've already extracted, to avoid considering them again when you meet them again).

By far the best architecture, anyway, would be if arcExtract was to return a list of all the files it has extracted (specifically their destination paths) -- then you could simply keep extending a list with all these extracted files during the os.walk loop (no recursion), and then keep looping just on the list (no need to keep asking the OS about files and directories, saving lots of time on that operation too) and producing a new similar list. No recursion, no redundancy of work. I imagine that readpst and 7z are able to supply such lists (maybe on their standard output or error, which you currently just display but don't process) in some textual form that you could parse to make it into a list...?

Recursion Recursion Recursion — How can i Improve Performance? (Python Archive Recursive Extraction)

Question

2 answers

solution1
1

solution2
1 ACCPTED

Recursion Recursion Recursion — How can i Improve Performance? (Python Archive Recursive Extraction)

Question

2 answers

solution1 1

solution2 1 ACCPTED

solution1
1

solution2
1 ACCPTED