简体   繁体   中英

Most efficient way to determine the size of a directory in Python

os.walk has a helpful example:

import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
    print(root, "consumes", end=" ")
    print(sum(getsize(join(root, name)) for name in files), end=" ")
    print("bytes in", len(files), "non-directory files")
    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

Despite the note that os.walk got faster in Python 3.5 by switching to os.scandir , this doesn't mention that it's still a sub-optimal implementation on Windows.

https://www.python.org/dev/peps/pep-0471/ does describe this & gets it almost right. However, it recommends using recursion. When dealing with arbitrary folder structures, this doesn't work so well as you'll quickly hit Python recursion limits (you'll only be able to iterate a folder structure up to 1000 folders deep, which if you're starting at the root of the filesystem isn't necessarily unrealistic. The real limit isn't actually 1000. It's 1000 - your Python call depth when you go to run this function. If you're doing this in response to a web service request through Django with lots of business logic layers, it wouldn't be unrealistic to get close to this limit easily.

The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.

def get_tree_size(path):
    total_size = 0
    dirs = [path]
    while dirs:
        next_dir = dirs.pop()
        with os.scandir(next_dir) as it:
            for entry in it:
                if entry.is_dir(follow_symlinks=False):
                    dirs.append(entry.path)
                else:
                    total_size += entry.stat(follow_symlinks=False).st_size
    return total_size

It's possible using a collections.deque may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM